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1.   INTRODUCTION 

1.1  Improving  the  Locality  of  Programs  -  Previous  Work 

Since  the  early  years  of  modern  computing,  people  have  realized 
that  due  to  cost-speed  tradeoffs,  computer  memories  of  very  large  overall 
capacity  must  be  organized  hierarchically.   The  introduction  of  memory 
hierarchies  in  computer  systems  created  the  problem  of  storage  allocation 
of  programs.   At  each  moment  during  the  execution  of  a  program,  the 
distribution  of  its  information  (code  and  data)  among  the  levels  of  the 
memory  hierarchy  must  be  determined.   The  programmer  was  faced  with  the 
additional  responsibility  of  manually  solving  this  memory  allocation 
problem.   This  was  not  an  easy  thing  to  do,  especially  with  the  introduction 
of  high  level  languages  which  shielded  programmers  from  the  details  of 
machines . 

The  idea  of  virtual  memory  systems  was  the  solution  to  this  problem. 
It  provided  an  elegant  way  of  achieving  automatic  storage  allocation 
[KILB62] , [SAYR69] .   Since  the  evolution  of  the  virtual  memory  concept  in 
the  early  1960s,  a  tremendous  amount  of  research  effort  has  gone  into  in- 
vestigating the  various  aspects  of  virtual  memory  systems.   Different 
methods  of  implementation  were  considered  and  contrasted:   segmentation, 
paging,  or  paged  segmentation.   Moreover  different  memory  management 
algorithms  were  investigated.   These  are  concerned  with  the  fetch  policy 
which  decides  when  an  item  of  virtual  memory  (a  page,  or  a  segment)  is  to  be 
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fetched  to  main  memory,  the  placement  policy  which  decides  where  to  place 
an  item  in  main  memory  and  the  replacement  rule  which  decides  which  item 
to  replace  if  there  is  no  space  for  the  new  item.   Both  fixed  and  variable 
memory  allotment  policies  were  considered  [BELA66] , [DENN68] , [CHU72] .   People 
have  used  the  number  of  item  faults,  the  efficiency  of  main  memory  utili- 
zation, and  the  space-time  product  cost  of  a  program,  to  measure  the  per- 
formance of  different  memory  management  schemes.   Principles  of  optimality 
have  been  defined  in  [BELA66] ,  [PRIE76] ,  and  [BUDZ77].   The  performances  of 
different  policies  were  measured  by  comparisons  to  the  performance  of  optimal 
policies.   People  often  use  reference  string  driven  simulation  techniques 
for  their  statistical  measurements  of  the  effects  of  varying  memory  allot- 
ment and  page  size  on  the  performance  of  different  policies.   A  survey  of 
the  work  done  in  this  area  and  some  results  can  be  found  in  [DENN70] ,  and 
[KUCK70] . 

The  central  reason  behind  any  success  which  a  virtual  memory  system 
might  achieve  is  the  property  of  locality  of  reference  which  programs  ex- 
hibit.  Denning  in  [DENN72a]  makes  the  following  three  statements  to  des- 
cribe the  locality  of  reference  property  of  programs : 

*  During  any  time  interval,  a  program  distributes  its 
references  nonuniforraly  over  its  address  space,  some 
pages  being  favored  over  the  others. 

*  The  density  of  reference  to  a  given  page  changes  slowly 
in  time  or  the  set  of  favored  pages  changes  membership 
slowly. 

*  Two  disjoint  segments  of  the  page  reference  string  tend 
to  be  highly  correlated  when  the  interval  between  them 
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is  short,  and  tend  to  become  uncorrelated  as  the  interval 
between  them  increases. 
It  has  been  confirmed  by  early  studies  that  the  degree  of  locality  of  a  pro- 
gram is  the  most  important  factor  in  its  cost  of  execution  in  a  virtual 
memory  computer.   Although  one  may  not  reduce  the  number  of  page  faults 
generated  by  a  program  by  more  than  30  or  40  percent  by  changing  the  page 
replacement  algorithm  [BELA66] ,  an  improvement  of  a  factor  of  5  was  achieved 
by  improving  the  locality  of  programs  [COME67] .   Thus  it  was  recognized 
that  efforts  should  be  directed  to  develop  techniques  to  improve  the  locali- 
ty of  programs  before  executing  them  in  virtual  memory  systems.   This  was 
an  absolute  necessity  for  certain  kinds  of  programs,  namely  those  pro- 
cessing large  multi-page  arrays. 

There  are  two  approaches  to  the  problem  of  improving  the  locality 
of  reference  strings  generated  by  programs.   In  the  first  approach  the 
programmer  was  expected  to  follow  certain  rules  and  guidelines  when  coding 
the  solution  to  different  problems.   In  the  second  approach  people  tried 
to  devise  automatic  or  semi-automatic  locality  improvement  techniques. 
In  the  following  two  sections  we  will  discuss  briefly  the  previous  work 
done  in  these  two  areas.   We  will  give  illustrative  examples  and  sample 
results.   In  Section  1.2  we  will  point  out  the  deficiencies  and  problems 
in  the  previous  work,  present  our  philosophy  and  approach  to  the  problem, 
and  finally  sketch  the  outline  of  this  thesis. 
1.1.1  Programmer  Implemented  Locality  Improvement  Techniques 

It  did  not  take  too  much  time  for  people  to  realize  that  virtual 
memory  computers  did  not  relieve  the  programmer  completely  from  worrying 
about  the  memory  needs  of  a  program.   When  programmers  worked  under  the 


assumption  that  in  a  virtual  memory  computer  they  could  get  all  the  memory 
space  they  needed,  the  costs  of  running  some  programs  were  high  [FINE66], 
[BRAW68] ,[GLAS65]. 

Several  papers  have  been  published  to  give  programmers  rules  and 
guidelines  when  writing  code  to  solve  large  problems  in  a  virtual  memory 
computer.   Some  of  these  papers  were  oriented  towards  specific  applica- 
tions and  problems,  others  were  of  more  general  nature.   Examples  of  the 
problem  oriented  work  can  be  found  in  [BRAW70] , [BOBR67] , [DUBR72] ,  and  [ROGE73], 
which  treat  sorting,  list  processing,  solution  of  eigenvalue  problems,  and 
the  solution  of  linear  equations  respectively.   [McKE69] , [MOLE72] ,  and 
[ELSH74]  are  examples  of  papers  which  address  the  general  problem  of 
algorithms  for  large  matrix  programs  in  a  paging  environment.   Moreover, 
manufacturers  of  virtual  memory  computer  systems  started  to  devote  sections 
of  manuals  to  help  programmers  develop  a  programming  style  for  virtual 
storage  systems  [IBM73]. 

A  good  representative  of  this  approach  to  improve  program  locality 
is  the  work  of  Elshoff  in  [ELSH74] .   He  was  concerned  with  the  processing 
of  multi-dimensional  arrays  in  a  paging  environment.   In  particular  he  con- 
sidered two  dimensional  arrays  which  were  assumed  to  be  stored  row-wise. 

2 
An  NXN  matrix  satisfied  the  relation  N  <  Z  <  N  ,  where  Z  is  the  page  size. 

Elshoff  presented  some  rules  to  be  used  by  programmers  when  writing  code 
to  solve  matrix  problems.   He  applied  his  individual  rules  and  their  comb- 
inations to  two  example  programs,  namely  matrix  transpose  and  matrix 
multiplication.   He  also  derived  analytical  expressions  for  the  number  of 
generated  page  faults  when  executing  under  an  LRU  page  replacement  algorithm. 
Moreover,  he  executed  the  original  programs  and  the  improved  programs  on  a 
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dedicated  machine.   The  matrices  were  square  matrices  of  size  101x101,  each 
spanning  20  pages  of  virtual  space  with  a  page  size  of  512  words.   Figure  1-a 
and  Table  1  show  the  results  for  the  matrix  transpose  program.   Figure  1-b 
and  Table  2  show  the  results  for  the  multiplication  program. 

There  are  two  very  important  conclusions  which  one  can  make  by  ex- 
amining these  figures  and  tables.   The  first  is  that  programs  which  process 
large  arrays  of  data  can  have  very  serious  problems  if  executed  in  virtual 
memory  computers.   The  second  is  that  the  amount  of  improvement  which  was 
attained  by  the  suggested  techniques  is  very  significant. 
1.1.2   Automatic  or  Semi-Automatic  Locality  Improvement  Techniques 

The  main  attractive  feature  of  virtual  memory  systems  is  the  auto- 
matic management  of  memory  allocation.   Hence  the  approach  presented  in  the 
previous  section  seems  to  be  a  step  backward,  since  the  programmer  is  re- 
quired to  follow  certain  rules  while  programming  for  a  virtual  memory  computer. 
Many  of  the  programming  guidelines  are  either  problem  oriented  or  cannot  be 
applied  in  simple,  direct  ways  to  complex  and  large  programs.   Hence  it 
seems  that  if  anything  is  to  be  done  to  programs  to  improve  their  locality 
properties,  it  should  be  taken  care  of  by  the  computer  software  system  and 
not  by  the  programmer. 

Several  people  took  this  approach  [COME67] , [HATF7  1] , [MASU74] ,  and 
[FERR74].  All  these  researchers  worked  on  what  is  called  the  'pagination 
problem'.  A  program  has  a  number  of  modules:  main  procedure,  subroutines, 
and  data  blocks.  Assuming  that  a  page  can  hold  more  than  one  module,  the 
pagination  problem  can  be  simply  stated  as  trying  to  group  these  modules 
or  blocks  in  pages  such  that  the  program  generates  a  more  local  reference 
string  when  executed  in  a  virtual  memory  computer.   Thus  the  aim  is  to 


Table  1. 

Results  for  the  Matrix  Transpose  Program  [ELSH74] 
(Memory  Allotment  15K) 


Algorithm      Problem  System  Total  Elapsed  I/O 

Used          CPU  CPU  CPU  Time  Time 

Standard         .819  9.900  10.719       77.5  66.8 

Combination  of 

All  Improvement  1.110  1.408  2.518  11.0  8.5 

Rules 


Table  2. 

Results  for  the  Matrix  Multiply  Program  [ELSH74]  . 
(Memory  Allotment  15K) 


Algorithm      Problem      System      Total      Elapsed      I/O 
Used  CPU  CPU        CPU        Time        Time 

Standard        197.3       4493.9     4691.2      19460.      14768.4 

Combination  of 

All  Improvement  222.7  6.9      229.6       252  22.7 

Rules 
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modify  a  program's  layout  in  virtual  space.   This  is  called  "program 
restructuring."  If  the  program's  modules  are  relocatable  with  respect  to 
each  other,  this  can  be  done  by  relinking  the  modules  after  changing  the 
order  in  which  they  are  presented  to  the  linker,  otherwise  changes  in  the 
source  code  and  recompilation  of  some  modules  might  be  needed.   Informa- 
tion about  the  dynamic  behavior  of  the  program  is  gathered  during  an  in- 
formation gathering  run.   This  information  is  used  to  construct  a  re- 
structuring non-directed  graph  for  the  program  according  to  a  particular 
restructuring  algorithm.   The  nodes  of  the  graph  represent  the  modules  of 
the  program.   The  numerical  labels  of  the  edges  represent  the  desirability 
that  the  nodes  they  connect  be  laid  out  together  within  the  same  page. 
After  the  restructuring  graph  is  constructed  a  clustering  algorithm  is  used 
to  obtain  the  new  layout  for  the  program  from  the  graph.   The  clustering 
algorithm  aims  at  "determining  a  linear  arrangement  of  nodes  (of  the 
restructuring  graph)  in  pages  which  maximize  the  vicinity  of  those  pairs 
having  the  highest  labels"  [FERR76b]. 

The  main  difference  between  researchers  in  this  area  is  the  restructur- 
ing algorithm  they  used.   Hatfield  and  Gerald  introduced  the  nearness  method 
for  a  restructuring  algorithm  [HATF71].   They  argued  that  performance  can  be 
improved  if  consecutive  blocks  or  modules  in  the  block  reference  string 
generated  by  a  program  were  grouped  in  the  same  page.   Hence  the  label  E. . 
of  the  edge  connecting  nodes  i  and  j  in  the  restructuring  graph,  is  in- 
cremented by  one  every  time  block  i  is  referenced  directly  after  j  or  block 
j  is  referenced  directly  after  i.   In  their  extension  to  the  nearness 
method,  Masuda,  Shiota,  Noguchi,  and  Ohki  [MASU74]  incremented  E..  if  ref- 
erences to  i,j  are  separated  by  some  small  distance  in  time.   Ferrari  in 
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[FERR74] ,[FERR75] ,[FERR76a] ,  and  [FERR76b]  takes  into  explicit  account  the 
memory  management  policy  of  the  system  when  designing  the  restructuring 
algorithm.   He  argues  that  each  page  replacement  policy  assumes  a  certain 
model  of  the  ideal  program  behavior,  which  is  the  behavior  of  a  program 
for  which  all  the  predictions  made  by  the  policy  are  correct.   Hence  a  pro- 
gram is  restructured  such  that  its  behavior  is  as  predictable  by  a  certain 
policy  as  possible.   Thus  he  introduced  different  program  tailoring 
algorithms  for  different  memory  management  policies:   the  critical  LRU 
restructuring  algorithm  for  the  LRU  replacement  policy,  the  critical  work- 
ing set  restructuring  algorithm  for  the  working  set  policy  and  so  on.   In 
the  working  set  policy,  for  example,  the  block  reference  string  and  the 

knowledge  of  the  window  size  T  of  the  working  set,  W,  (t,T),  allow  us  to 

b 

identify  the  blocks  which  will  be  in  memory  at  each  reference  of  the  string. 

The  critical  working  set  tailoring  (restructuring)  algorithm  increments  by 

1  all  the  labels  of  the  edges  (in  the  restructuring  graph)  which  connect  a 

critically  referenced  block  (a  block  which  is  not  in  W,  (t,T))  to  all  the 

b 

nodes  of  the  members  of  W,  at  the  time  the  critical  reference  is  issued. 

b 

Ferrari  experimented  by  applying  his  algorithms  to  a  collection  of  programs. 
Some  of  his  experimental  results  are  shown  in  Table  3.   The  cost  of  the 
restructuring  algorithms  in  terms  of  computer  time  varies  roughly  linearly 
with  the  number  of  references  in  the  string  to  be  examined.   The  cost  of  the 
clustering  algorithm  (i.e.,  determining  which  nodes  of  the  restructuring 
graph  should  be  grouped  in  one  page)  increases  less  than  quadratically  with 
the  number  of  nodes  in  the  restructuring  graph.   One  notices  that  the  data 
collection  which  is  needed  for  restructuring  is  expensive  and  difficult  in 
today's  systems.   Restructuring  was  recently  implemented  on  the  SIRIS  8 
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operating  system  in  France.   A  reduction  of  40%  to  70%  in  the  page  fault 

rate  was  reported  [BAB077]. 

1.2  Problems  with  Previous  Work  and  Our  Approach 

It  is  clear  from  the  discussion  presented  in  the  previous  section 
that  the  locality  of  almost  any  program  can  be  improved  in  one  way  or 
another.   This  leads  to  the  conclusion  that  most  programs  are  not  naturally 
and  by  their  intrinsic  properties  well  suited  to  run  in  a  virtual  memory 
system.   In  fact  the  very  early  experimental  evaluation  studies  of  virtual 
memory  systems  did  point  out  that  if  these  systems  are  going  to  achieve 
an  excellent  level  of  performance  then  it  must  be  assumed  that  the  system 
software  of  the  machine  or  the  programmer  will  do  the  work  necessary  to 
adapt  programs  to  virtual  memory  systems  [FINE66] .   These  studies  have 
shown  that  programs  which  are  written  without  paying  any  attention  to  the 
paging  problem  tend  to  need  a  large  fraction  of  their  virtual  space  in  main 
memory  in  order  to  execute  efficiently.   This  reduces  the  effectiveness 
and  advantages  of  virtual  memory  systems. 

We  adopt  the  point  of  view  that  the  locality  improvement  work  should 
be  done  automatically  by  special  software  facilities  whether  these  are 
separate  from  the  rest  of  the  system  software  of  the  machine  or  integrated 
into  some  parts  of  it.   Thus  the  central  drawback  of  all  the  work  presented 
in  Section  1.1.1  is  that  it  puts  the  burden  of  locality  improvement  on  the 
programmer.   The  program  restructuring  approach  of  automatically  improving 
the  locality  of  programs  suffers  from  its  limited  scope  of  applicability. 
The  main  assumption  which  is  made  in  the  restructuring  approach  is  that  the 
individual  modules  of  a  program  are  smaller  than  the  page  size.   This  is 
true  for  code  modules  and  data  modules  of  programs  handling  small  aggregates 
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of  data  like  scalars  or  small  arrays.   It  is  not  true  however  of  many 
practical  programs.   The  size  of  a  data  block  can  easily  exceed  the  size 
of  a  page.   For  example  the  size  of  a  32x32  double  precision  matrix  is  8 
kilobytes  which  makes  2  pages  of  the  IBM  370/158  virtual  memory  space, 
and  arrays  are  often  much  larger  than  32x32.   There  are  numerous  scientific 
application  programs  in  which  tens  of  large  arrays  are  used.   Elshoff's 
measurements  give  a  hint  of  the  very  poor  performance  which  will  result 
if  these  programs  are  run  on  virtual  memory  computers.   The  problem  will 
be  much  worse  in  the  future,  because  as  the  CPU  speed  grows  from  one  com- 
puter model  to  its  successor,  people  will  improve  the  models  they  are  using 
in  their  programs  because  of  the  better  computational  power  they  have 
available.   This  will  definitely  blow  up  the  array  sizes  used  in  programs. 
One  can  argue  that  main  memory  is  getting  cheaper  every  day,  machines  will 
have  more  memory  attached  to  them,  and  hence  the  problem  will  not  be  so 
bad.   The  counter  argument  is  that  however  cheap  memory  and  I/O  devices 
are  going  to  become,  they  will  remain  the  most  expensive  parts  of  a  computer 
system.   So  the  question  we  face  is  one  of  cost-effectiveness. 

We  summarize  the  previous  discussion  as  follows.   Examining  the 
amount  of  improvement  which  Elshoff  was  able  to  get,  one  concludes  that 
virtual  memory  computers  cannot  survive  without  doing  something  to  the  kind 
of  programs  which  Elshoff  worked  with.   Moreover,  it  is  clear  that  the  re- 
structuring approach  will  not  help  these  programs  much.   The  data  blocks 
in  such  programs  are  simply  much  larger  than  a  page  size  and  thus  the  prob- 
lem of  which  blocks  should  be  grouped  in  the  same  page  is  not  the  core 
problem  here. 
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The  purpose  of  this  thesis  is  to  provide  algorithms  for  automatic 
localitity  improvement  techniques  which  can  be  used  in  an  optimizing  compiler 
when  compiling  programs  with  large  (multi-page)  arrays.   In  Chapter  Two  we 
will  explore  some  of  the  theoretical  fundamentals  behind  this  subject.   We 
will  discuss  concepts  like  performance  measurement  criteria  and  modeling 
of  program  behavior.   In  Chapter  Three  we  will  present  our  transformation 
algorithms.   In  Chapter  Four  we  discuss  some  experiments  which  we  performed 
on  a  collection  of  Fortran  programs.   In  these  experiments  we  evaluated 
the  amount  of  improvement  which  was  achieved  by  applying  our  transformations 
to  these  programs  using  the  LRU  and  the  working  set  memory  management 
policies.   We  show  that  the  amount  of  improvement  achieved  is  comparable  to 
the  improvement  achieved  by  Elshoff  by  his  programmer  implemented  techniques 
[ELSH74].   In  several  of  our  programs  we  encountered  the  working  set 
anomalies  as  described  in  [FRAN78].   We  have  done  some  experiments  to  in- 
vestigate this  anomalous  behavior  of  the  working  set  policy.   We  also  did 
some  experiments  which  are  related  to  the  problem  of  modeling  of  program 
behavior.   We  conclude  this  thesis  in  Chapter  Five  by  pointing  out  some 
interesting  problems  for  future  research. 
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2.   FUNDAMENTAL  CONCEPTS 


In  this  thesis  we  are  concerned  with  the  paging  problem 
of  scientific  programs  which  handle  large  aggregates  of  data 
in  the  form  of  vectors  or  multi-dimensional  arrays.   Due  to  the 
nature  of  such  programs,  their  paging  activities  will  be  mainly 
dominated  by  data  paging  of  arrays.   Hence,  in  our  study  we  will 
ignore  memory  references  to  scalar  variables.  Moreover,  we 
will  ignore  memory  references  to  instructions.   Although  we 
believe  that  our  locality  improvement  techniques  will  also 
improve  the  locality  of  references  to  code  pages,  we  will  simplify 
our  discussion  by  separating  code  and  data  pages  and  concentrate 
on  analyzing  data  paging.   Since  data  paging  dominates  the  I/O 
activity  of  the  type  of  programs  we  are  interested  in,  ignoring 
code  paging  does  not  affect  the  accuracy  of  our  results  in  any 
significant  way. 

We  start  this  chapter  in  Section  2.1  by  a  brief  discussion 
of  performance  measurement  criteria  of  paged  virtual  memory 
systems.   In  Section  2.2,  we  will  address  the  modeling  problem 
of  program  behavior.   A  survey  of  the  previous  work  is  presented 
in  Sections  2.2.1  and  2.2.2.   Traditionally,  people  were  concerned 
with  modeling  the  locality  property  of  reference  strings.   In 
Section  2.2.3,  we  will  present  our  own  different  point  of  view. 
We  will  be  concerned  with  identifying  localities  at  the  source 
program  level.   All  the  properties  of  reference  strings  can  be 
attributed  to  source  program  structures.   In  Section  2.2.3.1,  we 
develop  the  elementary  loop  model  (ELM)  of  program  localities.   We 


15 
will  present  examples  of  loops  which  follow  this  model.   In 

Section  2.2.3.2,  however,  we  show  examples  of  loops  which  cannot 
be  modeled  by  this  model.   Nevertheless,  such  loops  can  be  trans- 
formed such  that  they  will  follow  the  ELM.   The  required  trans- 
formations are  part  of  those  discussed  in  Chapter  3. 

2. 1   Criteria  for  Performance  Evaluation 

Since  the  purpose  of  our  work  is  to  improve  the  locality  of 
programs,  we  need  to  define  some  measurement  tools  to  be  used  for 
the  evaluation  of  the  degree  of  locality  of  programs.   Several  of 
these  tools  can  be  defined.   There  are  two  main  categories  of  these 
measurement  criteria. 

In  the  first  category,  one  measures  the  intrinsic  character- 
istics of  a  program,  irrespective  of  the  type  of  machine  environment 
where  this  program  is  to  run.   In  other  words,  the  characteristics  of 
program  locality  intervals  are  the  criteria  to  be  used.   Although 
there  has  been  no  general  agreement  on  the  definition  and  the  method 
of  isolation  of  localities  of  a  program,  the  important  characteristics 
of  these  localities  which  determine  the  cost  of  running  the  program  in 
different  environments  can  easily  be  recognized.   The  first  character- 
istic is  the  amount  of  memory  required  by  each  locality.   The  second 
is  the  length  of  time  the  program  will  stay  in  this  locality.   These 
characteristics  are  called  the  size  of  the  locality  set  of  pages  and 
the  duration  of  the  locality  interval.   Thus,  to  compare  two  programs, 
one  says  that  the  program  with  the  smaller  and  longer  locality  intervals 
is  a  better  program.   Moreover,  the  manner  in  which  the  program  moves  from 
one  locality  to  another  is  important.   More  1/0  activity  will  be  generated 


when  adjacent  localities  have  very  few  common  pages. 

Another  way  of  measuring  the  locality  of  a  program  is 
by  measuring  the  cost  of  its  execution  in  a  virtual  memory 
computer.   Here  one  needs  to  differentiate  between  monoprogrammed- 
dedicated  systems  and  multiprogrammed-general  purpose  systems. 

In  the  monoprogrammed  case,  the  program  is  allocated  all 
the  users'  primary  memory  space  of  the  machine.   Thus,  the  cost 
of  running  the  program  is  proportional  to  the  time  it  spends 
in  the  system,  or  its  turnaround  time.   This  in  turn  is 
dependent  on  the  amount  of  time  the  CPU  is  used  and  the  amount 
of  time  the  I/O  channels  are  utilized.   In  simple  words,  the 
turnaround  time  is  given  by  the  equation  (assuming  paging  is 
done  on  demand  only) 

Turnaround  time  =  CPU  time  +  I/O  time. 
The  I/O  time  is  totally  dependent  on  the  degree  of  locality  of 
the  program.   The  CPU  time  is  mainly  dependent  on  the  amount  of 
calculation  performed  by  the  program.   The  efficiency  of  any 
technique  which  is  to  improve  the  locality  of  a  given  program 
is  measured  by  the  ratio  of  turnaround  time  of  the  original 
program  to  that  of  the  transformed  program.   If  the  transformation 
technique  does  not  change  the  CPU  time  in  any  significant  way, 
then  the  ratio  of  the  I/O  time  of  the  original  program  to  that 
of  the  transformed  program  is  the  measure.   This  ratio  also 
reflects  the  improvement  in  the  throughput  of  a  monoprogrammed 
system.   Hence,  the  better  the  locality  of  programs,  the  higher 
the  throughput  of  a  monoprogrammed  system. 

The  analysis  of  multiprogrammed  systems  is  more  complex. 
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We  must  make  some  assumptions  in  order  to  use  the  cost  of 
execution  of  programs  in  a  multiprogramming  environment  as  a 
measure  of  their  degree  of  locality.   On  an  abstract  level, 
one  can  say  that  in  a  multiprogrammed  system  there  are  three 
resources:   CPU  bandwidth,  main  memory  bandwidth,  and  I/O 
bandwidth.   The  CPU  bandwidth  reflects  the  computational 
capability  of  the  system,  the  main  memory  bandwidth  reflects 
the  size  and  speed  of  the  main  memory,  and  the  I/O  bandwidth 
reflects  similarly  the  size  and  speed  of  I/O  devices  and 
peripherals.   In  order  to  use  the  cost  of  execution  as  a  degree 
of  locality  measure,  we  must  assume  that  the  system  is  totally 
saturated.   In  other  words,  the  CPU,  main  memory,  and  I/O 
channels  are  100%  utilized.   If  this  is  the  case,  then  a 
program  will  be  using  the  CPU  with  a  portion  of  its  virtual 
space  present  in  main  memory.   The  rest  of  the  main  memory  will 
be  occupied  by  other  programs  that  are  doing  I/O  or  waiting 
for  I/O  or  CPU  service.   When  the  running  program  references 
a  page  which  is  not  in  main  memory,  it  loses  the  CPU  to  another 
program  and  T  time  units  will  pass  before  it  gets  hold  of  the 
CPU  once  more.   T,  the  reactivation  time,  is  the  sum  of  the  I/O 
time  and  the  system  overhead  time  necessary  to  service  a  page 
fault.   A  program  will  be  charged  by  the  system  as  long  as  it 
is  occupying  some  part  of  the  main  memory,  whether  it  is  using 
the  CPU  or  not.   Hence,  the  cost  of  executing  a  program  under 
such  conditions  is  proportional  to  the  time  integral  of  the 
main  memory  space  it  is  using  at  any  instant  of  time  over  its 
total  life  time  in  main  memory.   This  integral  is  called  the 
space-time  cost. 
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A  multiprogrammed  system  can  use  one  of  several  memory 

management  policies.   If  the  local  LRU  replacement  algorithm  is 

used,  the  program  will  be  assigned  a  fixed  number  of  page 

frames  all  through  its  lifetime  in  main  memory.   When  a  page 

fault  occurs,  only  one  of  the  program's  own  pages  will  be 

replaced.   Thus,  the  space-time  product  cost  for  a  fixed  memory 

allotment  is  given  by: 

space- time  cost  =  m  *  (T*PF  +  tp)  page  frames-seconds. 

m  =  #  of  page  frames  allocated. 

T  =  average  reactivation  time  in  seconds. 

PF  =  #  of  page  faults  during  the  program's  lifetime. 

tp  =  the  time  period  in  which  the  CPU  was  used 

by  the  program. 

If  a  variable  memory  allocation  policy  is  used  like  the  page 

fault  frequency  replacement  algorithm  [CHU72]  or  the  working 

set  policy  [DENN68],  then  the  program  will  go  through  a  sequence 

of  states  S1 ,  S9,  ...,  S.,  ...,  S  .   The  program  will  stay  for 

t.  seconds  in  state  S.  and  will  have  m.  main  memory  page  frames 

assigned  to  it  during  S..   Hence,  for  variable  memory  allotment 

we  have: 

space-time  cost  =   Em.  *  t.,  for  all  i. 

1    1 ' 

One  can  measure  the  degree  of  locality  of  a  program  by  the 
inverse  of  its  space-time  cost.   Hence,  another  way  of  measuring 
the  effectiveness  of  a  transformation  technique  in  improving  the 
locality  of  a  program  is  by  measuring  the  ratio  of  the  space-time 
cost  of  the  original  program  to  that  of  the  transformed  program. 
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This  will  also  be  proportional  to  the  improvement  in  the 

throughput  of  the  system.   If  the  reduction  of  the  space-time 
cost  is  accompanied  by  a  reduction  in  the  average  number  of  page 
frames  assigned  to  the  program  during  its  lifetime,  then  an 
improvement  of  the  degree  of  multiprogramming  will  also  be 
achieved. 

We  will  use  all  these  different  criteria  to  measure  the 
improvement  of  the  behavior  of  programs  in  different  environments. 
For  monoprogrammed  systems  we  will  use  the  number  of  page  faults 
generated  as  a  function  of  memory  allotment.   For  multiprogrammed 
systems,  we  will  use  the  space-time  cost  as  a  function  of  average 
memory  allotment.   We  will  also  consider  the  intrinsic  character- 
istics of  program  localities;   namely,  the  size  of  the  locality 
set,  its  lifetime,  and  the  transition  behavior  from  one  locality 
set  to  another.   In  the  next  section  we  will  clarify  the  concept 
of  locality  and  define  the  characteristics  of  a  locality  from 
the  source  code  structure  of  the  program. 

2.2  Modeling  Program  Behavior 

Although  the  term  "program  behavior"  has  broad  implications, 
it  is  usually  used  to  mean  the  behavior  of  page  references  of 
programs.   Here  we  will  also  restrict  ourselves  to  this  specific 
aspect  of  program  behavior.   The  page  referencing  behavior  is 
very  important  in  all  computer  systems  analysis  and  simulation 
studies.   There  have  been  two  methods  of  analysis  of  computer 
systems;  namely,  mathematical  queuing  models  and  simulation 
models.   In  the  mathematical  models,  people  have  been  using 
simplified  and  inaccurate  models  of  program  behavior.   In  simu- 
lation studies,  people  use  traces  of  actual  programs  to  drive 
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their  models.   There  are  several  drawbacks  to  the  use  of  reference 

strings  in  simulation  studies.   Because  it  is  very  expensive  to 

generate  reference  strings,  people  experiment  with  a  very  small 

number  of  programs — in  most  cases,  about  five  programs  or  so. 

Their  programs  may  not  be  truly  representative  of  typical  programs. 

In  many  cases,  it  is  difficult  to  extrapolate  the  behavior  of 

the  experimental  programs  to  similar  programs.   Another  drawback 

of  reference  strings  is  that  they  may  contain  more  detail  than 

is  necessary  for  accurate  system  modeling. 

Thus,  there  is  an  obvious  need  for  accurate  models  of 
program  behavior.   These  models  will  replace  real  program  traces 
in  simulation  studies.   They  will  be  used  to  generate  reference 
strings  for  these  studies.   The  length  of  the  reference  string 
generated  by  these  models  can  be  of  any  desired  length.   Another 
advantage  of  a  model  over  an  actual  reference  string  is  that  it  can 
be  used  in  analytic  studies  while  a  reference  string  cannot.   More- 
over, the  model  does  not  need  any  large  storage  space  like  a  reference 
string. 

There  have  been  several  efforts  to  develop  such  models.   All 
people  who  worked  in  this  area  looked  upon  the  reference  strings 
generated  by  programs  as  the  observed  phenomenon  to  be  modeled.   The 
property  of  concern  of  these  strings  is  the  locality  property.   Two 
types  of  models  have  been  suggested:   stochastic  models  and 
deterministic  models.   We  discuss  the  previous  work  in  these  two 
areas  in  the  following  two  sections. 
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2.2.1   Previous  Work  -  Stochastic  Models 

Different  stochastic  models  have  been  proposed.   A  detailed 

discussion  of  these  models  can  be  found  in  Spirn's  book  [SPIR77]. 

The  most  important  model  is  the  LRU  stack  model  and  its  extensions 

[DENN72b],  [ARVI73],  and  [SHED72].   For  a  reference  string 

r1 ,  r„,  ...,  r  ,  ...  at  any  time  t,  the  LRU  stack  is  an  ordered 

vector  P(t)   =   (P1(t),  P2(t),  ...,  P±(t),  ...,  Pn(t))  where 

n  is  the  number  of  pages  in  the  program  and  P.(t)  is  the  identifier 

of  the  ith  most  recently  referenced  page  at  time  t.   For  the 

reference  string  r. ,  r  ,  ...,  r  there  is  a  corresponding 

distance  string  d^  d2>  ...,  d^   If  P(t-l)   =  (P^t-1),  P2(t-1),  ..., 

P.(t-l),  ....  P  (t-1))  and  r  =  P.(t-l)  then  d  =  i.   In  other 

words  r  is  at  distance  d  in  P(t-l).   In  the  simple  LRU  model 

each  distance  is  assigned  a  probability: 

P  [d  =  i]   =  a.      ,    1  <  i  <  n. 
r   t  i  —   — 

In  order  for  the  LRU  model  to  exhibit  the  locality  property,  it 

should  satisfy  the  condition: 

a,  >  a.  >  ...  >  a  . 
1  —  2  —     —  n 

This  locality  condition  has  been  shown  to  be  approximately  true 

for  real  programs  [DENN72b].   The  distance  probabilities  can  be 

determined  from  measurements  on  real  programs.   In  the  distance 

string  d  ,  d  ,  ...,  d,   corresponding  to  a  reference  string  of 

a  program  one  can  count  the  number  of  occurrences  of  a  certain 

distance  i,  then 

a,   =  maximum  likelihood  estimate  of  a.  =  (number  of 
i  l 

occurrences  of  distance  i)/k. 
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The  problem  with  this  method  is  its  expense  and  that  there  is  no 
obvious  way  of  "perturbing"  these  measurements  to  model  other 
strings.   Empirically  it  was  found  that  approximations  to  the  a.'s 

can  be  derived  from  Belady's  lifetime  function  [BELA69]: 

-k 
A.    =  an   +  a0  +   . . .   +  a.    ~l-ci      ,        l<i<n,      l<k<3. 

1    1    2         i  —   —  " 

Although  the  simple  LRU  model  did  produce  good  predictions  of 
the  average  working  set  size  and  page  fault  rate  of  some  real 
programs  in  [DENN72b],  it  fails  to  predict  all  aspects  of  realistic 
program  behavior.   For  example,  in  real  programs  page  faults  tend  to 
occur  in  clusters.   This  happens  when  a  program  enters  a  new  phase  of 
execution.   The  LRU  model  does  not  predict  this  clustering  effect 
[DENN75].   In  a  memory  of  m  page  frames,  the  probability  of  a  page 
fault  under  LRU  replacement  algorithm  is  given  by 

Pr[rt  t   P(t-l)]  =  aM   +  a^  +  . . .  +  aR  =  1-A^ 
This  is  a  constant  probability  all  through  the  execution  of  the 
program.   The  time  until  the  next  page  fault  is  not  affected  by  the 
number  of  faults  that  occurred  recently. 

The  simple  LRU  model  suffers  from  another  problem.   It  can  be 
shown  that  for  LRU  stack  model  programs,  the  page  fault  rate  under 
static  LRU  is  better  than  that  for  a  dynamic  algorithm  with  the  same 
average  size  [SPIR77].   This  is  in  contradiction  with  the  experimental 
evidence  available  in  literature  that,  for  example,  the  working  set 
algorithm  performs  better  than  LRU.   The  LRU  model  assumes  that  the 
size  of  the  locality  set  is  fixed  while  in  real  programs  it  is  varying. 
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There  have  been  several  attempts  to  improve  the  simple  LRU 

model.   To  account  for  clustering  of  page  faults  during  phase 

transitions,  the  distance  probabilities  must  be  allowed  to  vary 

in  time.   Thus  we  will  have 

a,   >  a„   >  . . .  >  a 

l,t  -  2,t  -    -  n,t  > 

for  all  t,  but  in  general  a.    f   a.  ... -,  •   In  a  simplified  analysis 
one  would  assume  that  there  are  two  distributions  of  the  distance 
probabilities.   One  represents  the  intraphase  behavior  and  is 
biased  toward  the  top  of  the  stacks.   The  second  corresponds 
to  phase  transition  behavior  and  is  biased  toward  larger  stack 
distances.   A  two-state  Markov  chain  can  be  used  to  choose 
between  the  distance  distributions.   This  is  shown  in  Figure  2. 
In  state  1  the  intraphase  distribution  is  used.   In  state  2  the 
phase  transition  distribution  is  used.   1-p  is  the  probability 
of  making  a  phase  change  and  p  is  the  probability  of  staying  in 
the  same  phase.  p>>q  because  programs  do  not  spend  much  virtual 
time  in  phase  transitions.   Although  this  two-distribution  model 
exhibits  the  clustering  of  page  faults  and  phase  transition 
phenomenon,  it  does  not  allow  for  changes  in  a  program's  locality 
set  size.   This  requires  a  distribution  for  each  locality  set 
size  and  possibly  more  than  one  distribution  to  model  phase 
transitions.   The  multiple-distribution  model  is  complicated, 
impractical,  and  attempts  at  validating  it  have  been  unsuccessful. 
Other  Markovian  models  have  been  discussed  in  the  literature 
[SPIR77]  and  [SHED72],   There  are  several  problems  with  many  of 


24 


1  -  p 


1  -  q 


Figure  2.   Two-State  Markov  Chain 
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them.   Mainly  these  problems  are  validation,  complexity,  and 

practicality  problems.   The  stochastic  approach  to  modeling  program 

behavior  seems  to  go  in  a  vicious  circle.   If  the  proposed  model  is 

simple  and  practical,  it  is  not  accurate.   On  the  other  hand,  if  more 

accuracy  is  incorporated  in  a  model,  it  becomes  complex,  impractical, 

and  difficult  to  validate.   We  choose  to  end  the  discussion  of 

stochastic  models  at  this  point  and  refer  the  reader  interested  in 

more  details  to  [SPIR77]. 

2.2.2  Previous  Work  -  Deterministic  Models 

As  was  mentioned  previously,  the  locality  property  is  the  central 
property  of  reference  strings  which  everybody  is  trying  to  model.   In 
all  the  literature  dealing  with  stochastic  models  of  program  behavior, 
people  talk  about  the  locality  property  in  a  vague  manner.   People 
argue  that  at  any  moment  of  time  t,  there  exists  a  set  of  favored 
pages  which  the  program  tends  to  reference  for  a  long  period  of  time. 
This  set  is  called  the  locality  set  and  the  time  which  the  program 
spends  referencing  its  member  pages  is  called  the  residence  time  in  the 
particular  locality  set  [DENN72a].   Thus  the  program  will  go  through  a 
sequence  of  states  S  ,  S?,  ...,  S.,  ...,  S,  during  its  execution.   A 
sequence  of  2-tuples  (1^,  T^  ,  (L2 ,  T2)  ,  ...,  (L±f  T±)  ,  ...,  (Lk>  Tk> 
is  associated  with  the  sequence  of  states.   In  state  S.  the  program 
references  the  L.  locality  set  of  pages  for  a  duration  of  T..   People 
who  worked  in  the  development  of  the  LRU  stack  model  assumed  that  a 
program  has  n  locality  sets  at  any  time,  n  being  the  depth  of  the 
stack.   The  Ith   locality  set  consists  of  the  I   most  recently  used 
pages,  1  <_  £  <_  n.   "The  true,  or  favored,  locality  set  will  then  be 
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the  smallest  set  whose  retention  in  memory  leads  to  an  acceptably 

low  page  fault  rate"  [SPIR76].   However,  no  method  is  provided  to 

isolate  one  of  the  n  localities  as  being  the  true  locality  set. 

The  work  of  Batson  and  Madison  [BATS76a] ,  [BATS76b],  [BATS76c] 

is  the  only  attempt  found  in  literature  to  date  to  provide  a  formal 

definition  of  a  locality  set  and  a  method  to  isolate  locality  sets 

in  a  reference  string.   To  cure  the  deficiencies  of  the  simple  LRU 

model,  Batson  and  Madison  extended  the  LRU  stack  to  include  two  new 

ordered  vectors.   Thus,  at  each  moment  of  time  t,  three  ordered 

vectors  are  kept  to  describe  the  state  of  the  reference  string: 

P(t)  =  (P  (t),  P9(t),  ...,  P,(t),  ...,  P  (t)); 
1      z  1  n 

a(t)  =  (a  (t),  a  (t) ,  ...,  a  (t),  ...,  a  (t)); 
1      z  l  n 

T(t)  =  a\(t),  T_(t),  ...,  T.(t),  ...,  T  (t)). 
1      /  l  n 

* 
P(t)  is  the  LRU  stack  of  segment  identifiers  as  defined  earlier. 

a. (t)  is  the  time  at  which  the  segment  in  the  i-th  stack  position 

was  last  referenced.   T.(t)  is  the  time  at  which  a  reference  was 

l 

last  made  to  a  stack  position  greater  than  i.   In  other  words,  T.(t) 

is  the  time  after  which  the  i  top  positions  of  the  stack  were  occupied 

by  members  of  S.(t).   S.(t)  is  the  set  of  the  i  most  recently  referenced 

segments.   At  each  time  t,  there  is  a  hierarchy  of  sets  S(t)  =  (S..  (t)  , 

S-(t),  ...,  S.(t),  ...,  S  (t)).   In  this  hierarchy  S .  (t)a  S .. .  (t)  . 
z  l  n  l       it± 

T.(t)  can  be  described  as  the  formation  time  of  S.(t).   Figure  3  shows 
a  reference  string  and  its  P(30),  a(30) ,  and  T(30)  [BATS76a]. 

An  activity  set  at  time  t  is  any  set  of  segments  in  the  LRU 


* 

Batson  and  Madison  studied  only  segmented  virtual  memory  systems. 

We  will  discuss  the  implications  of  this  limitation  later. 
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Figure    3-a.      A   Reference   String,    Its   LRU   Stack, 
and   BLI's    [BATS76a] 
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Figure  3-b.   The  P,  S,  a,  and  T  Vectors  at  t  =  30 

for  the  String  in  Figure  3-a  [BATS76a] 
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hierarchy  in  which  every  member  of  that  set  has  been  re-referenced 
since  the  set  was  formed.   In  terms  of  the  a(t)  and  T(t)  stacks,  an 
activity  set  at  time  t,  A.  (t)  ,  is  any  S.(t)  for  which  a.(t)  >  T^t). 
At  each  instant  during  program  execution,  zero  or  more  activity  sets 
are  recognized  at  various  levels  of  the  LRU  hierarchy.   Moreover, 
when  a  reference  is  made  to  any  segment  which  is  below  a  particular 
segment  in  the  LRU  stack,  then  this  activity  set  (and  any  set  above 
it)  is  terminated. 

A  bounded  locality  interval  (BLI)  is  defined  as  the  2-tuple 
consisting  of  an  activity  set  and  its  lifetime  or  residence  at  the 
top  of  the  stack.   In  Figure  3,  the  BLI's  of  the  example  reference 
string  are  shown  [BATS76a].   Notice  the  hierarchical  structure 
of  the  BLI's.   In  [BATS76a]  algorithms  are  given  to  update  the 
P(t),  o(t),  and  T(t)  stacks.   Also  experimental  results  concerning 
the  characteristics  of  the  BLI's  are  presented  in  [BATS76a]  and 
[BATS76c].   In  Chapter  4  of  this  thesis,  we  will  discuss  Batson's 
experimental  results  and  the  validity  of  their  implications.   We 
have  implemented  Batson's  algorithms  and  applied  them  to  our  collec- 
tion of  Fortran  programs.   We  have  correlated  the  syntactic  structure 
of  programs  and  found  several  problems  with  the  concept  of  bounded 
locality  intervals.   Some  of  these  are: 

1.   As  is  mentioned  in  [BATS76a],  the  way  BLI's  are  defined 
lead  to  identifying  a  tremendous  number  of  very  short  BLI's. 
These  BLI's  have  no  indication  of  locality  or  any  significance.   They 
only  add  undue  expense  to  generating  the  experimental  data.   Figure  4 
shows  a  real  example  taken  from  one  of  our  programs.   Only  references 
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15 


DO  15   KK  =  l.KMAX 
FD(KK)  =  FD(KK)  +  273 
FE(KK)  =  FE(KK)  +  l.E-3 
TTA(KK)  =  TTA(KK)  +  273 
QWl(KK)  =  QWl(KK)  +  l.E-3 


Figure  4-a.   An  Example  Loop 
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Figure  4-b.   The  BLI's  Generated  by  the  Program  in  Figure  4-a 
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to  array  elements  are  considered.   Also  every  array  is  identified 
with  only  one  segment.   There  is  a  one-to-one  correspondence 
between  array  names  and  segment  names.   The  level  one  BLI  which  is 
the  true  locality  interval  generated  by  the  looping  structure  is  of 
duration  258.   However,  every  time  each  statement  in  the  loop  is 
executed,  a  level  two  BLI  is  generated  with  a  duration  of  2  references 

2.   Long  lived  BLI's  can  be  generated  which  will  have  a  mis- 
leading indication  of  locality.   Figure  5-a  shows  another  real 
example  from  one  of  our  programs  which  illustrates  this  situation. 
In  Figure  5-b,  we  show  the  structure  of  the  generated  BLI's.   In  the 
first  loop  the  arrays  DZ,  PO,  QW1,  TTA,  RHO,  FD,  and  FE  are  referenced 
A  level  one  BLI  of  duration  293  references  and  size  7  will  be 
generated  and  it  reflects  a  true  locality  interval  because  of  the 
loop.   In  the  following  loop  the  referenced  arrays  are  PI,  QVS,  HU1, 
FO,  RH01,  QW1,  PO,  and  TTA.   In  Figure  5-b,  we  notice  that  there  are 
four  BLI's  covering  the  execution  of  the  second  loop.   The  BLI  which 
is  the  true  reflection  of  the  second  loop  is  the  level  4  BLI.   The 
BLI's  at  levels  1,  2,  3  are  meaningless  and  give  false  indication  of 
localities.   Each  of  these  contain  some  array  names  which  are  common 
to  loops  1  and  3  and  were  never  referenced  in  loop  2.   Note  that  in 
Figure  4  the  true  BLI  is  the  level  1  BLI  but  in  Figure  5  the  true 
BLI  reflecting  the  second  loop  is  the  level  4  BLI.   Thus  there  is  no 
general  rule  which  can  be  used  to  locate  the  BLI  reflecting  the  real 
locality  just  by  examining  the  BLI's  generated  from  the  trace  of  a 
program.   We  will  elaborate  on  the  confusion  which  the  hierarchical 
structure  of  BLI's  creates  later. 


DO     1       I  =  2,KMAX 
Al  =  DZ(I) 

PO(I)  =  QW1(I)  +  TTA(I)  +  PO(I) 
1       RHO(I)  =  FD(I)  +  FE(I)  +  PO(I) 

DO     2       I  =  1,KLES 
PO(I)  =  PO(I)  +  5 
PI(I)  =  PO(I)/P 
FO(I)  =  PI(I)*2 
TTA(I)  =  TTA(I)/PI(I) 
QVS(I)  =  PO(I)  *  3 
HU1(I)  =  QW1(I)/QVS(I) 
IF  (IIUl(I)  .  GE  .  .4)  GO  TO  3 
HU1(I)  =  .  4 
QW1(I)  =  QVS(I)  *  .4 
3    RHOl(I)  =  PO(I)/QWl(I) 
2       Continue 

DO      4     I  =  2,KLES 
Al  =  RHO(I) 
FD(I)  -  TTA(I)  +  2 
FE(I)  =  QW1(I)  +  1 
A2  =  TTA(I)  *  3 
BA(I)  =  RHOl(I)  *  XJA(I)/DZ(I) 
BB(I)  =  RHOl(I)  +  TTA(I)/(DZ(I)-5) 
i       Continue 

Figure  5-a.   An  Example  Program 
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The  problem  which  is  illustrated  by  the  example  in  Figure  5 

could  be  cured  if  the  definition  of  an  activity  set  was  modified. 

If  an  activity  set  was  defined  as  any  set  of  segments  of  the  LRU 

hierarchy  in  which  every  member  of  that  set  has  been  re-referenced 

k-times  since  that  set  was  formed,  k  >  1,  then  we  will  have  only 

one  BLI  covering  the  execution  of  loop  2  in  the  example  of 

Figure  5.   This  modification  will  also  reduce  the  number  of  very 

short  BLI's.   Although  Batson  in  [BATS76b]  mentions  that  Peter 

Denning  did  suggest  this  modification  in  the  definition  of  an 

activity  set  to  him,  he  did  not  modify  the  definition.   The 

modification  would  increase  the  complexity  and  the  expense  of 

finding  the  BLI's  in  real  traces  of  programs.   Moreover,  it  is 

not  obvious  how  one  should  choose  k.   The  more  important  fact  is 

that  this  suggestion  does  not  really  solve  the  problem  of  the 

confusion  in  interpreting  the  hierarchical  structure  of  the  BLI's. 

This  is  illustrated  in  the  next  example. 

3.   BLI's  have  an  inconsistent  correlation  to  the  syntactic 

structure  of  programs.   For  example,  the  existence  of  a  hierarchy 

of  BLI's  is  a  necessary  but  not  sufficient  condition  for  the 

existence  of  a  nested  loop  in  the  source  program.   A  nested  loop 

will  generate  a  multilevel  hierarchical  BLI  structure.   The 

existence  of  a  multilevel  BLI  structure,  however,  can  be  due  to 

other  reasons.   In  Figure  6-a,  the  loop  is  double  nested.   This 

loop  generates  a  two-level  BLI  structure.   In  the  first  loop  of 

Figure  6-b ,  the  arrays  A,  B,  C,  D,  and  E  are  referenced.   A  subset 

of  this  array  set,  namely,  A,  B,  C,  and  D,  are  referenced  in  the 


DO    1.1=  1,100 
A(I)  =  B(I)  *  C(I) 
DO    1   .J  =  1,100 
D(I,J)  =  D(I,J)  *  A(I) 
1     Continue 


((A,B,C,D);30300) 
level  1 


t- 


\ 


level  2 
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((A,D);300)  ((A,D);300)         ((A,D);300) 


Figure  6-a.   A  Doubly  Nested  Loop  and  Its  BLI ' s 
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DO    10     I  =  1,100 
A(I)  =  B(I)  *  C(I) 
D(I)  =  B(I)  *  E(I) 
C(I)  =  E(I)  **  2 

10    Continue 

DO    20     I  =  1,100 
B(I)  =  A(I)  -  C(I)  *  D(I) 

20    Continue 

DO    30     I  =  1,100 
E(I)  =  0 

30    Continue 


((A,B,C,D,E);1300) 

i 1 


level  1 


((A,B,C,D);400)  (E;100) 


,  H -h  level  2 


Figure  6-b.   Consecutive  Loops  and  their  BLI's 
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second  loop.   The  array  E  is  referenced  in  the  third  consecutive 
loop.   For  this  situation,  we  also  have  a  hierarchical  BLI 
structure.   This  structure  is  misleading.   The  first  loop  is  not 
reflected  in  any  BLI.   The  level  one  BLI  hints  at  the  existence 
of  a  locality  of  size  5  during  the  execution  of  the  three  loops. 
This  is  of  course  not  true.   In  this  situation,  we  really  have 
three  localities.   The  first  one  is  of  size  5,  its  members  are 
A,  B,  C,  D,  E,  and  it  covers  only  the  first  loop.   This  is 
followed  by  a  locality  of  size  4,  its  members  are  A,  B,  C,  D, 
and  it  covers  the  second  loop.   The  last  locality  is  of  size  1, 
it  contains  the  E  array,  and  it  covers  the  third  loop.   Denning' s 
suggestion  will  not  change  the  problem  with  the  BLI's  in  this 
example. 

4.   There  is  no  simple,  obvious  way  of  isolating  the  major 
phases  of  execution  of  a  program  from  its  BLI's.   In  other  words, 
it  is  not  obvious  how  to  get  the  sequence  of  2-tuples  (L1,T1), 
(L_,T„),  ...,  (L.,T.),  ...  for  a  program  from  its  BLI's. 

In  [BATS76a],  level  one  BLI's  of  10  milliseconds  or  greater 
duration  are  taken  to  be  the  major  phases  of  execution.   Our 
examples  in  Figures  5  and  6-b  illustrate  situations  where  level 
one  BLI's  give  erroneous  information.   In  Figure  6-a,  the  program 
spends  most  of  its  time  referencing  arrays  of  level  2  BLI's.   To 
avoid  these  problems,  a  procedure  is  suggested  in  [BATS76b]  to 
determine  a  pathway  through  the  BLI  hierarchy  such  that  the  space- 
time  cost  of  executing  the  given  program  is  minimized.   The  BLI's 
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which  are  included  in  this  pathway  are  taken  to  define  the  major 
phases  of  execution.   The  procedure  suggested  in  [BATS76b]  does  not 
really  minimize  the  space-time  cost.   The  correct  algorithms  for 
minimizing  the  space-time  cost  of  running  a  program  were  developed 
by  Budzinski  in  [BUDZ77].   These  algorithms  are  complex  and  expensive. 
Moreover,  the  localities  of  a  program  are  supposed  to  be  machine 
independent  while  in  the  approaches  of  [BATS76b]  and  [BUDZ77]  the 
minimum  space-time  product  is  dependent  on  machine  parameters  such 
as  the  mean  time  needed  to  transfer  a  segment  (or  a  page)  from 
secondary  to  primary  storage. 

From  the  previous  discussion,  it  is  clear  that  the  locality 
sets  isolation  problem  has  not  really  been  solved.   In  the  next 
section  we  present  our  different  approach  and  solution  to  the 
problems  presented  in  Sections  2.2.1  and  2.2.2. 
2.2.3  Our  Approach  -  Analysis  of  Program 
Behavior  at  the  Symbolic  Level 

We  think  that  there  are  two  main  reasons  for  the  difficulties 
which  people  faced  when  trying  to  come  up  with  satisfactory  models  of 
program  behavior.   The  first  reason  is  due  to  the  approach  taken  in 
attacking  this  problem.   Traditionally,  people  took  reference  strings 
generated  by  programs  to  be  the  observed  phenomenon  of  interest. 
Thus,  for  them  a  program  serves  only  the  purpose  of  generating  a 
reference  string  and  then  it  can  be  ignored.   However,  the  center  of 
concern  should  really  be  the  program  itself  and  not  the  reference 
string.   There  is  almost  nothing  important  in  a  reference  string  which 
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is  not  reflected  in  the  source  program.   Thus,  our  approach  will  be 
to  study  and  analyze  programs  at  the  source  level.   Although  the 
complexity  of  programs  was  probably  the  main  reason  why  people  avoided 
studying  programs  at  the  source  level,  one  can  overcome  this  dif- 
ficulty by  recognizing  that  scientific  programs  have  few  basic  struc- 
tures.  One  can  start  by  studying  the  most  simple  structure  and  then 
move  to  more  elaborate  ones.   As  it  turns  out,  a  clear  understanding 
of  simple  structures  can  be  extended  rather  easily  to  more  complex  ones. 
For  more  discussion  about  program  analysis  at  the  source  level  see 
[BATS 76c] . 

The  second  reason  for  the  difficulties  of  modeling  program  be- 
havior is  due  to  the  programs  themselves.   As  we  will  demonstrate  later 
in  this  chapter,  programs  as  written  by  people  do  not  behave  well  in  a 
paging  environment. 

We  will  adopt  the  following  strategy  in  our  study.   First,  we 
will  develop  a  model  for  an  ideal  program.   In  developing  this  model  we 
will  discuss  the  important  characteristics  of  such  an  ideal  program. 
Next  we  show  that  it  is  possible  to  find  some  programs  in  the  real  world 
which  follow  this  model.   However,  we  will  give  examples  of  other  pro- 
grams which,  as  written  by  people,  do  not  follow  this  model.   In  Chapter 
3  we  develop  automatic  transformation  algorithms  which  can  be  used  to 
force  most  programs  to  follow  this  model.   Moreover,  these  transformations 
reduce  the  cost  of  execution  of  programs  in  virtual  memory  computers. 
Thus  the  transformations  make  programs  behave  better  (they  will  be  easier 
to  model  and  manage)  and  cost  less  to  execute. 
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In  our  study  we  will  separate  data  and  code  pages.   For  pro- 
grams with  large  data  aggregates,  code  paging  is  trivial  compared  to 
data  paging.   We  are  mainly  concerned  with  the  data  paging  problem. 
Moreover,  we  will  ignore  references  to  scalars.   These  same  assumptions 
were  made  in  [BATS76a]-[BATS76c] .   Most  scientific  production  programs 
are  written  in  Fortran.  Moreover,  there  is  a  good  reason  to  believe 
that  versions  of  Fortran  will  continue  to  evolve  and  exist  for  a  long 
time  to  come.  Hence,  without  a  loss  of  generality,  we  will  use  examples 
of  Fortran-like  programs  and  structures.  All  through  this  thesis  we 
assume  that  paging  will  be  made  on  demand.   In  other  words  we  assume 
that  there  is  no  overlap  between  the  CPU  and  I/O  activities  of  the  same 
program. 

In  Section  2.2.3.1  we  develop  our  model  of  the  program  with  the 
ideal  behavior.   In  the  same  section  we  define  elementary  loops  and 
show  that  such  loops  follow  the  model  of  the  ideal  program.   Hence,  we 
will  call  our  model  the  elementary  loop  model  (ELM).   In  Section  2.2.3.2 
we  will  give  examples  of  programs  which  do  not  follow  the  ELM  model.   In 
the  same  section  we  will  mention  those  transformations  of  Chapter  3  which 
will  cure  specific  problems  with  different  programs. 

2.2.3.1  The  Elementary  Loop  Model 

What  is  the  ideal  behavior  of  a  program  in  a  paged  system? 
Ideally  a  program  will  need  only  a  small  fraction  of  its  virtual  space 
to  be  present  in  main  memory.   With  this  little  memory  allotment,  the 
mean  time  between  page  faults,  MTBPF,  will  be  large.   Moreover,  the 
program  will  make  effective  use  of  the  main  memory  page  frames  allotted 
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to  it.   Thus,  the  density  of  reference  to  each  page  will  be  high.   In 
other  words  the  mean  time  between  reference  to  each  page,  MTBR,  will 
be  small.   Moreover,  the  page  faulting  activity  will  be  clustered.   This 
leads  to  rather  long  periods  of  useful  CPU  activity  which  are  interrupt- 
free.   This  has  an  important  effect  in  multiprogrammed  systems.   If 
programs  have  cyclic  behavior  in  which  they  go  through  alternating 
periods  of  clustered  I/O  and  CPU  activities  then  the  scheduling  and 
other  problems  become  much  easier.   The  OS  CPU  time  will  be  decreased. 

The  description  given  in  the  previous  paragraph  is  that  of  a 
program  which  can  be  modeled  by  the  ideal  program  model.   Let  us  now 
define  one  kind  of  loops  which  follow  this  model. 

Definition  1;   An  elementary  loop  is  an  ordered  set  of  assignment  state- 
ments preceded  by  one  DO  control  statement.   The  variables  referenced 
in  the  loop  are  one-dimensional  arrays  and  possibly  scalars.   The  sub- 
scripts of  the  array  variables  are  linear  functions  of  the  index  vari- 
able.  In  the  subscript  expressions,  all  the  index  variables  have  the 
same  coefficient. 

As  an  example  of  the  behavior  of  an  elementary  loop  let  us 
discuss  the  behavior  of  the  following  program. 

Program  1. 

DO     S     1=1, N 
S    A(I)  =  2*1+3 

52  C(I)  =  B(I)**2-4*C(I) 

53  D(I)  =  C(I)/A(I) 

Let  Z  be  the  number  of  words  in  a  page,  N>>Z,  and  K  =  [N/Zl.   There  are 
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four  arrays  referenced  in  Program  1:   A,  B,  C,  and  D.   Each  array 
occupies  K  pages  of  virtual  space.   Let  us  denote  the  ith  page  of  A 
by  a(i).   Thus  A  will  span  the  virtual  pages  a(l),  a(2),  ...,  a(i),  . .., 
a(K) .   Similar  notation  will  be  used  for  the  pages  of  the  other  arrays. 
The  total  virtual  space  of  these  arrays  is  4*K  pages.   In  a  non-virtual 
memory  computer  this  program  will  need  4*K  pages  of  main  memory  to  run. 
If  this  amount  of  main  memory  is  not  available,  the  programmer  must  take 
care  of  transferring  parts  of  his  arrays  between  secondary  and  main 
memory  such  that  the  program  will  run  in  less  than  the  total  virtual 
space.   In  a  virtual  memory  computer,  however,  the  operating  system  will 
automatically  take  care  of  this  problem.   The  operating  system  need  only 
assign  4  pages  to  this  program  and  the  program  will  run  in  an  optimum 
way  under  demand  paging.   It  will  have  the  minimum  number  of  page 
faults,  or  I/O  transfers  between  secondary  and  main  memory.   Moreover, 
its  space-time  cost  will  be  minimum.   With  4  pages  of  main  memory,  the 
program  will  have  4  page  faults  when  it  starts  execution  in  order  to 
allocate  a(l),  b(l),  c(l),  and  d(l).   After  this  burst  of  I/O  activity 
the  loop  will  go  through  Z  iterations  without  any  I/O  interrupts.   The 
I/O  interrupt-free  CPU  activity  will  last  for  7*Z  memory  references. 
7  is  the  number  of  array  memory  references  per  iteration  of  the  loop. 
During  the  CPU  activity  period  the  MTBR  to  the  pages  of  the  program  will 
be  <  7  references.   Thus  the  density  of  reference  to  these  pages  is  high, 
Another  burst  or  cluster  of  I/O  activity  will  follow  to  allocate  a(2), 
b(2),  c(2),  d(2)  in  main  memory.   In  the  next  burst  of  CPU  activity  the 
loop  index  will  go  from  Z  +  1  to  2*Z  and  the  duration  of  this  CPU  burst 
will  be  another  7*Z  references.   This  oscillation  or  cycling  between 
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bursts  of  I/O  and  CPU  activity  will  continue  through  the  lifetime  of 

this  program.   In  the  Ith  cycle  the  pages  a(I),  b(I),  c(I),  and  d(I) 

will  be  allocated  and  then  processed.   The  I/O  burst  time  will  be  4*T, 

T  being  the  average  time  of  servicing  a  page  fault  (measured  in  memory 

references)  and  the  duration  of  the  CPU  burst  will  be  7*Z  references. 

The  cycle  time,  T  ,  will  be  4*T  +  7*Z  references.   Thus  the  mean  time 
c 

between  the  clusters  of  page  faults  is  large,  4*T+7*Z.   This  behavior 
will  be  the  same  for  the  LRU,  FIFO,  or  MIN  replacement  algorithms.   The 
total  number  of  page  faults  will  be  4*K  and  the  total  space-time  cost 
will  be  4*K(4*T  +  7*Z)  .   This  is  a  well  behaved  program.   In  a  multi- 
programming system,  programs  of  this  type  will  make  the  best  use  of 
the  system.   I/O  and  CPU  bursts  of  different  programs  can  be  overlapped 
such  that  the  I/O  and  CPU  utilization  will  be  maximized.   The  memory 
space  will  be  saturated  with  different  parts  of  different  programs  to 
maximize  throughput.   Such  programs  will  run  efficiently  in  virtual 
memory  computers. 

To  have  such  a  nice  performance,  Program  1  needs  4  pages  of 
main  memory.   If  3  or  less  pages  are  assigned  to  it  we  will  have  one 
or  more  page  faults  per  iteration.   The  number  of  page  faults  will  be 
very  large,  0(N),  instead  of  0(K),  where  N  is  the  number  of  words  in  an 
array  while  K  is  the  number  of  pages  spanned  by  the  array.   In  addition 
to  the  large  increase  in  I/O  activity,  the  program  will  lose  the  nice 
property  of  clustered  page  faults  or  bursts  of  I/O  activity.   The  use- 
ful CPU  activity  will  be  constantly  interrupted  by  page  faults.   The 
performance  of  the  virtual  memory  system  will  collapse  under  such 
conditions . 
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For  every  elementary  loop ,  there  is  a  critical  memory  allotment 

which  is  needed  in  order  to  avoid  performance  collapse.   In  the  case  of 

Program  1  this  number  is  4.   In  general  we  will  denote  this  number  by  m  . 

o 

The  behavior  of  Program  1  and  similar  programs  can  be  nicely  modeled  by 
the  sequence  of  the  2-tuples: 

(Lv    T1),  (L2,  T2),  ...,  (L±,  T±),  ...,  (Lk,  Tfc) 

L.  =  the  ith  locality  set  of  pages 

T.  =  the  residence  time  in  this  locality  set  of  pages. 

For  Program  1  L.  =  (a(i) ,  b(i),  c(i),  d(i)},  and  T.  =  4*T 

+  7*Z.   The  size  of  L.,  |l.  ,  is  equal  to  m  which  is  constant  at  4  for 

1   '  l ■  o 


all  i.   Moreover,  T.  is  the  same  for  all  i  and  is  equal  to  the  eye 


le 


time,  T  ,  as  discussed  previously.   Note  that  the  phases  of  execution 
of  this  program  have  been  easily  identified. 

Since  the  behavior  of  an  elementary  loop  follows  precisely  the 
model  of  the  ideal  program,  we  will  denote  the  ideal  program  model  by 
the  elementary  loop  model  (ELM) .   Note  that  the  ELM  and  an  elementary 
loop  are  two  different  things.   An  elementary  loop  was  defined  in 
Definition  1.   The  ELM  is  the  model  of  the  ideal  program.   An  elementary 
loop  is  an  ideal  program  and  it  can  be  modeled  using  the  ELM.   Other 
loops,  however,  can  also  be  modeled  by  the  ELM.   The  following  are  the 
necessary  conditions  which  must  hold  for  a  given  loop  so  that  it  can 
be  modeled  by  the  ELM: 

The  Critical  Memory  Allotment, 

m  =  0  (#  of  different  array  names  in  the  loop);    2.1 
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The  Cycle  Time, 

T  =  0  (R  *c  +  m  *T),  where  2.2 

C         X,        o 

Rp  =  #  of  occurrances  of  array  names  in  the  loop, 

c  =  integer  constant  (#  of  iterations  per  cycle) ; 

Mean  Virtual  Time  Between  Clusters  of 

Page  Faults,  MTBPF  =  0(R  *c);  2.3 

Mean  Virtual  Time  Between  References  to  a 

page,  MTBR  =  0(R£).  2.4 

Equations  2.1-2.4  are  the  definition  of  the  ELM. 

Before  proceeding  any  further,  let  us  generalize  an  observation 

which  we  made  concerning  the  execution  of  Program  1  to  all  elementary 

loops . 

Theorem  1:   Given  an  elementary  loop  L,  let 

m  =  the  number  of  different  array  names 
o 

referenced  in  the  loop . 

Rp  =  the  number  of  array  references  per 

iteration  of  the  loop. 

T  =  the  average  page  fault  service  time. 

K  =  the  number  of  pages  spanned  by  each 

array  referenced  in  the  loop. 

With  m  page  frames,  the  cost  of  executing  the  loop  will  be  the  same 

whether  the  replacement  algorithm  used  is  the  LRU,  FIFO,  or  Belady's 

MIN  algorithm.   The  cycle  time  is  given  by: 

T  =  R  *  c  +  m  *  T,  where  c  =  Z/(the  coefficient  of  the 
ex-        o 

index  variable  in  the  subscript  expressions  of  the  array  variables). 
The  space-time  cost  is  given  by: 
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ST  =  T   *  m  *  K 
c    o 

Proof:   When  the  execution  of  an  elementary  loop  is  started,  m 

o 

different  pages  will  be  referenced.   If  m  page  frames  are  allotted 
to  the  loop,  all  the  three  replacement  algorithms  will  allocate  these 
page  frames  to  the  first  locality  set  of  pages.   In  other  words,  the 
pages  referenced  in  the  first  cycle  of  execution  will  be  allocated 
space  in  main  memory. 

From  our  previous  discussion  in  this  section,  the  loop  will 
have  a  cyclic  behavior.  We  will  use  induction  to  prove  our  theorem. 
First,  we  show  that  the  three  replacement  algorithms  replace  the  set 
of  pages  referenced  in  the  first  cycle  by  those  referenced  in  the  second 
cycle.  Second,  given  that  the  pages  referenced  in  the  (I-l)th  cycle 
will  be  in  memory  when  the  Ith  cycle  is  started,  we  will  show  that  the 
three  algorithms  will  replace  these  pages  by  the  pages  referenced  in 
the  Ith  cycle. 

When  the  first  references  to  the  pages  of  the  second  cycle  are 

made,  the  MIN  algorithm  will  replace  pages  referenced  in  the  first  cycle. 

This  is  because  the  forward  distance  of  all  these  pages  is  infinite. 

Similarly,  the  LRU  and  FIFO  algorithms  will  replace  pages  of  the  first 

cycle,  though  not  necessarily  in  the  same  order.   Thus,  all  these 

algorithms  will  produce  m  page  faults  and  the  second  cycle  time  duration 

will  be  R„  *  c  +  m  *  T.   Note  that  c  is  the  number  of  loop  iterations 
x,        o 

per  cycle. 

If  we  assume  that  the  pages  of  the  (I-l)th  cycle  will  be  in 
memory  when  the  execution  of  the  Ith  cycle  starts  then  by  a  similar 
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argument  to  the  one  presented  in  the  previous  paragraph,  we  conclude 

that  the  three  algorithms  will  replace  the  pages  of  the  (I-l)th  cycle 

by  those  of  the  Ith  cycle.   Hence,  in  general  the  cycle  time  is  given 

by  Rj  *  c  +  id  *  T,  and  the  total  space- time  cost  is  given  by  (Rp  *  c 

+  m  *  T)*m  *K. 
o       o 

Q.E.D. 

The  important  point  which  Theorem  1  makes  is  that  the  per- 
formance of  elementary  loops  will  not  be  affected  by  the  replacement 
algorithm  used.   It  is  totally  determined  by  the  amount  of  memory  allotted. 
Note  that  Theorem  1  does  not  hold  for  the  least  frequently  used  replace- 
ment algorithm. 

Although  elementary  loops  are  not  a  non-existing  species  in 

real  programs,  very  often  more  complex  loops  will  be  encountered.   Some 

of  these  can  still  be  modeled  by  the  ELM.   Others,  however,  cannot. 

In  the  next  section  we  will  discuss  some  examples.   In  chapter  3  we  will 

present  two  types  of  compile  time  optimizing  transformations.   The  first 

type  will  be  used  to  force  any  loop  to  behave  such  that  it  can  be 

modeled  using  the  ELM.   The  second  type  will  be  used  to  improve  the  cost 

of  execution  of  loops;  namely,  to  reduce  the  value  of  m  ,  number  of  I/O 

o 

transfers,  and  the  space-time  cost. 

2.2.3.2   Other  Loops 

In  this  section  we  show  examples  of  loops  which  are  not  ele- 
mentary.  Our  examples  fall  in  three  categories.   In  the  first  category, 
the  loops  are  not  elementary  but  their  behavior  follows  the  ELM.   Moreover, 
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Theorem  1  holds  for  these  loops.   In  the  second  category,  the  behavior 
of  the  loops  follow  the  ELM  model  but  Theorem  1  does  not  hold.   The 
behavior  of  such  loops  is  asymptotic  to  the  behavior  of  elementary  loops 
and  they  do  not  really  have  serious  problems.   In  the  third  category 
the  loops  do  not  follow  the  ELM  and  their  problems  are  serious.   In 
Chapter  3  effort  will  be  made  to  design  transformations  to  cure  the 
problems  of  such  loops.  A  loop  can  be  in  one  of  these  categories  for 
different  reasons.   In  what  follows  we  give  examples  of  these  different 
reasons, 
(i)  Multi-dimensional  arrays  in  loops. 

The  existence  of  large  multi-dimensional  arrays  in  a  loop  can 
easily  cause  problems  in  a  virtual  memory  system.   Let  us  first  give 
an  example  in  which  multidimensional  arrays  cause  no  problem  and  the 
behavior  of  the  program  can  still  be  modeled  by  the  ELM.   Consider  the 
following  loop : 

Program  2.    DO 
DO 


S1  I  -  1,N 


S   J  -  1,N 


S1   A(J,I)  =  B(J,I)  +  C(J,I) 


In  Fortran  two-dimensional  arrays  are  stored  column-wise.   In 

all  our  examples  and  analysis,  we  will  consider  large  arrays  which 

2 
satisfy  the  condition  N  <_   Z  <  N  .   If  each  of  the  arrays  of  Program  2 

spans  K pages,  then  a  close  examination  of  the  program  will  show  that  it 

can  indeed  be  modeled  by  the  ELM.   For  Program  2  the  MTBR  is  0(R.)  and 

m  =  #  different  array  names  =  3 
o 


MTBPF  =T  =Z*R0+T*m 
c         36         o 


=  Z*3  +  T*3 
ST  =  space-time  cost/cycle   =3*  (3*Z+T*3) 
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The  following  program,  however,  cannot  be  modeled  by  the  ELM: 

Program  3.    DO   S   I  =  1,N 
DO   S   J  =  1,N 
S1   A(I,J)  =  B(I,J)  *  C(I,J) 

To  make  the  analysis  simple  let  N  =  Z.   We  will  make  this 
assumption  through  the  rest  of  this  chapter.   Each  column  of  a  matrix 
will  span  one  page.   The  different  number  of  array  names  here  is  still 
3.   With  three  page  frames,  however,  three  page  faults  will  be  generated 
per  iteration  of  the  inner-most  loop.   There  is  no  clustering  of  page 
faults,  i.e.  CPU  and  I/O  activities  will  be  interleaved.   Consequently, 
the  system  will  suffer  performance  collapse.   This  loop  needs  all  its 
virtual  space  to  be  allotted  in  main  memory  in  order  to  generate  the 
minimum  number  of  page  faults  and  to  minimize  its  space-time  cost. 

The  reason  behind  the  difficulty  with  Program  3  is  that  the 
array  elements  are  not  being  referenced  in  the  order  in  which  they  were 
stored.   If  Program  3  was  written  in  PL1,  in  which  multi-dimensional 
arrays  are  stored  row-wise,  the  problem  would  disappear.   Tn  PL1, 
however,  Program  2  will  have  a  problem.   Thus  it  is  obvious  that  for  multi- 
dimensional arrays,  the  storage  scheme  and  the  pattern  of  reference  are 
important  in  determining  the  behavior  of  a  loop.   This  is  what  all  of 
Elshoff's  paper  was  about  [ELSH74];  matching  the  pattern  of  reference  to 
the  storage  scheme.   In  [McKE69]  three  storage  schemes  of  multi- 
dimensional arrays  were  compared:   row-wise,  column-wise,  and  submatrix 
storage.   If  RZ  =  /Z,  then  in  the  submatrix  storage  scheme  an  (nxn)  two- 
dimensional  matrix  will  be  divided  into  square  submatrices  of  size 
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o 
(RZ  x  RZ)  as  shown  in  Figure  7.   If  N  =  [n/RZl  then  there  will  be  N 

of  these  submatrices.   Each  submatrix  is  stored  in  a  page.   An  m- 

dimensional  array  with  the  dimensions  D.  x  D~  x  D,,  x  ...  x  D  will  be 

12    3m 

stored  in  D.  x  D.  x  . .  .  x  D  planes.   Each  plane  will  contain  D.,  x  D„ 
3    4         m  r  12 

array  elements.   There  will  be  TD  /RZl  rows  of  pages  and  TD  /RZl  columns 

of  pages  in  each  plane.   Hence  each  plane  will  have  TD  /RZl*[D  /RZl  pages 

The  element  of  the  array  with  the  subscripts  d,,  d_,  . ..,  d  will 

12  m 

belong  to  the   { (fa   /RZl-l)*rD2/RZl   +   Td^RZl   +    (d3-l)*    p   /RZl*rD2/RZl 

+    (d.-l)*[D    /RZ1*[D0/RZ1*D.   +    ...   +    (d   -l)*[Dn/RZl*fD0/RZl    *  D     *  D.    * 
4123  ml2  34 

...    *Dm_1}   page. 

In  [McKE69]  it  is  shown  that  matrix  algorithms  can  be  designed 
such  that  with  the  submatrix  storage  scheme,  enormous  reduction  in  the 
number  of  page  faults  relative  to  row-wise  storage  can  be  achieved. 

With  3  page  frames  and  the  submatrix  storage  scheme,  Program  3 
will  have  3  page  faults  every  RZ  iterations  of  the  inner-most  loop. 
The  duration  of  the  interrupt-free  CPU  activity  will  be  3*RZ .   This  is 
not  as  good  as  the  performance  in  Program  2  where  the  CPU  burst  time  was 
3*Z  references  long.   Moreover,  we  still  cannot  use  the  ELM  to  model  the 
behavior  of  program  3  even  if  the  submatrix  storage  scheme  is  used  to 
store  the  arrays.   The  problem  here  is  that  we  will  not  reference  all 
the  elements  involved  in  the  calculation  of  each  page  while  the  page  is 
in  main  memory.   In  Program  3  all  the  Z  elements  of  a  page  will  be 
referenced  in  the  calculation  while  only  RZ  elements  will  be  referenced 
every  time  the  page  is  in  main  memory.   Thus  a  given  page  will  be 
transferred  RZ  times  between  secondary  and  main  memory.   In  effect  what 
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«■  RZ  ■+ 

A 

RZ 

Page-1 

Page-2 

.  .  . 

Page-N 

Page- 
N+l 

Page^ 
N 

Figure  7.   A  Two-Dimensional  Array  Stored 
by  the  Submatrix  Scheme 
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we  are  saying  is  that  although  the  MTBPF  for  Program  3  is  better  with 
submatrix  storage  as  compared  with  column  storage  (3*RZ  compared  to  3) 
it  is  still  not  as  good  as  it  is  for  Program  2  (3*Z). 

In  Chapter  3  the  page  indexing  transformation  will  be  intro- 
duced to  cure  the  problems  of  multi-dimensional  arrays.   This  is  de- 
signed to  transform  a  program  such  that  all  words  of  a  page  involved  in 
a  calculation  will  be  referenced  while  the  page  is  in  main  memory.   We 
will  adopt  the  submatrix  storage  scheme  because  of  its  inherent  ad- 
vantages as  presented  in  [McKE69] . 
(ii)   Mixing  of  arrays  of  different  dimensions  in  a  loop. 

The  performance  of  a  loop  can  be  affected  in  different  ways 
when  arrays  of  different  dimensions  are  referenced.   Consider  the  fol- 
lowing example: 

Program  4-a.   DO     3     J  =  1,N 
DO     3     I  =  1,N 
T(I,J)  =  .5  *  DELT  +  TTA(I) 
3  continue 

Since  the  elements  of  the  two-dimensional  array  are  referenced 

in  the  order  in  which  they  are  stored,  column-wise,  the  two-dimensional 

array  represents  no  problem.   This  loop  can  be  modeled  by  the  ELM 

because  equations  2.1  -  2.4  are  satisfied.   Namely,  we  have 

m     =  0(#  different  array  names)  =  2 
o 

MTBR  =  0(R£)  =  2 

MTBPF  =  0(R  *Z)  =  2*Z 

T     =  0(R  *Z  +  m  *T)  =  0(2*Z  +  2*T) 

C  X,        o 
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Because  of  the  existence  of  the  one-dimensional  array,  T 

c 

is  not  fixed  through  the  execution  of  the  program.   In  the  first 
cycle  two  page  faults  will  occur  because  t(l)  and  tta(l)  must  be 
allocated.   Thus  T   =  2*Z  +  2*T.   In  the  following  cycles,  however, 
only  the  t  page  will  be  replaced.   Thus  the  steady  state  cycle  time, 
T   ,  is  given  by  2*Z  +  T.   Theorem  1  does  not  hold  for  this  loop  be- 
cause the  cycle  time  is  not  constant  althrough  the  lifetime  of  the 
program. 

In  other  situations,  Theorem  1,  will  not  hold  for  different 
reasons.   For  example,  the  following  loop  will  not  have  identical  per- 
formance under  LRU  and  MIN  replacement  algorithms. 

Program  4-b.   DO    3    J  =  1,N 
DO    3    I  =  1,N 
T(I,J)  =  T(I,J)  +  .5*TTA(J) 
3  continue 

The  reference  string  generated  during  the  two  iterations: 
(J  =  j-1,  I  =  N)  and  (J  =  j ,  I  =  1)  is  the  following: 

...,t(j-l),  tta(l),  t(j-l),  t(j),  tta(l),  t(j),... 

With  2  page  frames  under  LRU,  tta(l)  will  be  replaced  at  the  4th 

reference  to  allocate  t ( j ) .  MIN  will  replace  t(j-l).   Thus  under  LRU, 

T   =  3*Z  +  2*T  while  under  MIN  T   =  3*Z  +  T .   Note,  however,  that 
cs  cs 

this  loop  can  still  be  modeled  by  the  ELM  because  equations  2.1  to  2.4 
are  satisfied. 

In  the  previous  two  examples,  mixing  arrays  of  different 
dimensions  in  a  loop  did  not  present  severe  problems.   Both  loops  could 
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be  modeled  by  the  ELM  although  Theorem  1  does  not  hold  for  them. 
Their  behavior  is  asymptotic  to  the  behavior  of  elementary  loops, 
(iii)   Loops  with  assignment  statements  at  different  nest  levels. 
Consider  the  following  program: 
Program  5.    DO      3       J  =  1,N 
PT(J)   =  TTA(J) 
DO      3       I  =  1,N 
T(I,J)  =  .5*  DELT  +  TTA(J) 
3    Continue 
With  3  page  frames,  this  loop  will  have  T   =  (2*Z  +  2)  +  T 

which  is  0(R  *Z  +  m  *T) .   There  is,  however,  an  obvious  waste  in  the 

x,      o 

space-time  resource.   The  PT  page  is  referenced  only  once  during  a  cycle 
time.   In  other  words,  the  N  references  made  to  PT  are  uniformly  dis- 
tributed through  the  execution  time  of  the  loop.   This  is  reflected  by 
the  MTBR  to  the  PT  page  which  is  0(N)  instead  of  0(R„).   Hence  this 
loop  cannot  be  modeled  by  the  ELM.   The  loop  distribution  transforma- 
tion presented  in  Chapter  3  will  cure  this  problem. 

As  another  example  of  a  loop  with  large  MTBR,  consider  the  fol- 
lowing program: 

Program  6.    DO      10    I  =  1,N 
Al  =  W(I,1)*  X(I,1) 
DO      10    J  =  1,N 
A2  =  WW  *  G(J,I) 

Y(J,I)  =  Y(J,I)  +  (A1-A2)/DZ 
Al  =  A2 
10   Continue 
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Here,  with  4  page  frames,  T   will  be  (3*Z  +  2  +  2*T) .   The  MTBR  for 
the  W  and  X  pages  is  0(N)  and  not  0(R.).   Hence  the  ELM  will  not  hold. 
A  combination  of  the  scalar  expansion  technique  and  loop  distribution 
will  handle  the  situation  of  this  loop.   This  will  also  be  discussed 
in  Chapter  3. 
(iv)   IF  statements  in  loops. 

IF  statements  in  loops  will  control  the  order  of  execution  of 
assignment  statements.   Moreover  they  control  which  statements  are  to 
be  executed  during  every  iteration  of  the  loop.   Thus  the  memory  require- 
ment might  in  general  vary  between  two  cycles  or  even  within  one  cycle. 
Moreover,  the  cycle  time  might  vary  from  one  cycle  to  another.   Thus 
static  measurements  might  not  reflect  an  accurate  estimate  of  the 
parameters  of  the  ELM  for  a  loop  that  contains  an  IF  statement. 

IF  statements  can  be  classified  in  several  types  [TOWL76] .   One 
type  of  IFs  called  the  A-type  can  be  easily  removed  from  the  scope  of 
the  loop.   The  condition  tested  by  a  type-A  IF  is  independent  of  the 
loop  index  and  all  variables  computed  within  the  loop.   The  result  of  the 
test  will  be  the  same  for  all  iterations  of  the  loop.   This  type  of  IF 
is  illustrated  in  the  following  loop: 

Program  7-a.     DO       10       I  =  1,N 
IF  (S.EQ.O)   GO  TO  3 
A(I)  =  4*B(I)*C(I)  -  D(I)**2 
GO  TO  10 
3    A(I)  -  0 
10    Continue 
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The  IF  here  is  a  static  switch  which  can  be  removed  as  follows: 
Program  7-b.    IF  (S.EQ.O)  GO  TO  3 
DO   101     I  =  1,N 

101  A(I)  =  4*B(I)*C(I)  -  D(I)**2 
GO  TO  103 

3  DO   102     I  =  1,N 

102  A(I)  =  0 

103  Continue 

One  of  the  resulting  two  loops  will  be  executed  depending  on  the  value 
of  S.   Each  of  the  loops  can  be  modeled  by  the  ELM.   It  is  important 
to  note  that  we  are  not  using  the  ELM  to  predict  which  parts  of  the 
program  will  be  executed  and  which  will  not.   What  we  are  trying  to  do 
is  to  transform  programs  such  that  whatever  loops  are  executed  will  be 
loops  which  can  be  modeled  using  the  ELM. 

In  the  other  types  of  IFs,  the  condition  tested  will  be  a 
function  of  the  index  of  the  loop  or  some  variables  computed  in  the  loop. 
These  types  of  IFs  cannot  be  removed  outside  the  scope  of  the  loop  in 
the  simple  manner  illustrated  in  Program  7.   In  many  situations,  however, 
the  IFs  do  not  affect  all  the  statements  within  the  loop.   This  is 
illustrated  in  the  following  two  examples. 

Program  8-a.    DO      S4   I  =  1,N 
S    B(I)   =  G(I)  -  7*DELT 

52  IF(B(I)  .GT.0)   C(I)  =  C(I)/B(I)*D(I) 

53  C(I)  =  C(I)  +  5 

54  E(I)  =  D(I)  *  E(I) 
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Program  8-b.    TEMP  =  0 

DO      S        I  =  1,N 
S    A(I)  =  B(I)  *  C(I)  +  3 
S    IF  (TEMP.  GT  .  A(I))   F(I)  =  1 
S    TEMP  =  TEMP  +  X(I)  *  Y(I) 
In  Program  8-a  only  S?  is  affected  by  the  IF  and  in  Program 
8-b  SI  is  not  affected  by  the  IF.   The  loop  distribution  transformation 
will  transform  loops  such  that  either  the  resulting  loops  are  free  of 
IF  statements  or  all  the  assignment  statements  within  the  loop  are 
affected  by  the  IF  statements  such  that  they  must  be  left  in  the  same 
loop.   In  real  programs,  the  number  of  statements  and  arrays  in  the 
latter  type  of  loops  is  small  and  hence  the  variations  in  the  parameters 
of  the  ELM  for  these  loops  are  small. 

2 .3   Summary 

In  this  chapter  previous  stochastic  and  deterministic  models  of 
program  behavior  were  discussed.   The  difficulty  of  developing  a  simple 
accurate  model  of  program  behavior  is  due  to  the  fact  that  programs  as 
written  by  people  are  not  well  behaved  from  a  paging  system  point  of  view. 

The  concept  of  the  elementary  loop  model,  ELM,  was  developed 
and  the  parameters  of  this  model  were  discussed.   Examples  of  programs 
which  do  not  follow  this  model  were  presented.   In  the  next  chapter 
compiler  transformations  will  be  designed  to  cure  such  problems  as  those 
illustrated  by  the  examples.   Other  transformations  will  aim  at 
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improving  the  ELM  parameters  of  a  given  program.   Thus,  after  applying 
the  transformations  of  Chapter  3  to  programs,  they  will  be  simple  to 
model  and  cheap  to  run  in  a  virtual  memory  computer. 
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3.   PROGRAM  TRANSFORMATIONS 

A  large  portion  of  the  early  work  in  program  analysis  and  trans- 
formations was  motivated  by  the  development  of  high  speed  parallel  and 
vector  machines  like  the  ILLIAC  IV,  CDC  STAR,  and  TI  ASC  around  the  turn 
of  the  decade.   For  these  supercomputers,  and  the  more  recent  ones, 
the  Cray-1  and  Burroughs  Scientific  Processor,  the  need  for  a  vectorizing 
compiler  is  definite.   The  enormous  computational  power  of  these  machines 
cannot  be  widely  utilized  by  the  general  scientific  community  of  users 
unless  people  can  use  ordinary  high  level  languages  to  write  programs 
for  these  machines.   Moreover,  there  is  an  obvious  need  to  be  able  to 
run  the  large  amount  of  existing  software,  which  was  originally  written 
for  serial  machines,  on  the  new  machines. 

For  the  last  few  years  research  has  been  conducted  at  the 
University  of  Illinois  to  solve  these  problems.   The  problem  of  trans- 
forming ordinary  serial  programs  to  run  on  parallel  and  vector  machines 
has  been  investigated  and  the  results  have  been  very  good.   A  large  soft- 
ware package  called  the  PARAFRASE  compiler  evolved  with  the  progress  of 
these  investigations.   The  PARAFRASE  compiler  takes  an  ordinary  serial 
Fortran  program  and  uses  different  compiler  transformations  to  expose 
the  inherent  parallelism  of  the  program  [LEAS76] , [WOLF78] .   Pseudo- 
code is  generated  and  used  to  find  the  resulting  speedup  if  the  program 
were  executed  on  parallel  machines  compared  to  serial  machines. 
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The  theme  of  this  thesis  is  the  enhancement  of  the  performance  of 
virtual  memory  computers.   In  this  chapter  we  present  program  trans- 
formations to  achieve  this  goal.   These  are  intended  to  be  optimizing 
compiler  transformations  which  are  tailored  to  cure  the  problems  of 
large  programs  in  virtual  memory  computers.   Each  transformation  will 
serve  one  or  both  of  two  purposes.   The  first  aim  is  to  make  programs 
follow  the  ELM  and  the  second  is  to  improve  the  parameters  of  this  model 
for  a  given  program.   A  transformation  aimed  at  the  first  goal  is  a 
fix-up  transformation.   A  transformation  aimed  at  the  second  goal  is  an 
enhancement  transformation. 

Several  of  the  concepts  and  transformations  developed  for  speeding 
the  execution  of  programs  on  parallel  machines  will  be  useful  to  us 
either  with  or  without  modifications.   Thus  we  will  use  some  of  the  trans- 
formations implemented  in  the  PARAFRASE  compiler,  modify  some,  and  in- 
troduce some  new  ones.   We  will  think  of  the  transformations  and  present 
them  as  source-to-source  transformations.   Our  description  of  the  trans- 
formations which  were  developed  originally  for  parallel  program  execution 
will  be  very  brief.   We  will  present  the  modified  and  the  new  trans- 
formations in  more  details. 

The  flow  chart  shown  in  Figure  8  gives  an  overview  of  the  general 
transformation  process.  This  flow  chart  is  intended  to  help  the  reader 
of  this  chapter  understand  the  relationship  between  the  different  trans- 
formations and  their  relative  order.  Going  back  to  examining  this  flow 
chart  while  reading  this  chapter  will  clarify  the  purpose  and  the  logic 
behind  the  different  transformations. 
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Figure  8.   An  Outline  of  the  Transformation  Process 
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In  the  preliminary  transformations  stage  we  apply  (without 

modification)  the  following  set  of  transformations  which  are  currently 
implemented  in  PARAFRASE  [WOLF78] : 
(i)   DO  Loop  Normalization, 
(ii)   IF  Pattern  Matching. 
(iii)   Scalar  Renaming, 
(iv)   Induction  Variable  Substitution  and  Subscript  Cleaning, 
(v)   Type-A  IF  Removal  from  DO  Loops  . 

These  transformations  are  aimed  at  breaking  data  dependences,  and 
simplifying  the  control  structure  of  the  program.   We  will  not  discuss 
these  transformations  any  more  and  refer  the  interested  reader  to  [WOLF78], 

Basic  to  the  analysis  of  programs  and  development  of  transforma- 
tions is  the  concept  of  data  dependence.   A  brief  discussion  of  this  con- 
cept and  related  definitions  will  be  presented  in  Section  3.1  and  is 
based  on  [KUCK78] , [TOWL76] ,  and  [BANE76]. 

In  Sections  3.2  through  3.5  we  discuss  the  rest  of  the  transforma- 
tions.  In  genera]  we  will  present  in  each  section  some  necessary  defini- 
tions, some  examples  to  illustrate  the  usefulness  of  the  particular  trans- 
formation, the  transformation  algorithm,  and  if  needed  some  tests  to 
check  for  the  correctness  of  the  transformation.   We  will  try  to  strike 
a  balance  between  formal  and  informal  definitions  of  the  transformations. 
A  very  formal  definition  leads  to  complex  notations  which  explain  un- 
important details.   Although  we  will  present  the  transformations  as 
separate  entities,  the  intent  is  that  all  those  relevant  to  a  program 
segment  will  be  applied. 


64 


3. 1   Data  Dependence  Analysis 

The  set  of  input  variables  of  an  assignment  statement  S,  IN(S) ,  is 

the  collection  of  variables  appearing  to  the  right  of  the  assignment  symbol. 

The  output  variable  of  S,  OUT(S) ,  is  the  variable  which  is  assigned  a  value 

as  a  result  of  executing  statement  S.   The  output  variable  appears  to  the 

left  of  the  assignment  symbol.   When  S  is  executed  each  member  of  IN(S) 

is  fetched  from  memory  at  least  once  and  the  output  variable  is  stored  in 

memory.   Outside  loops, an  assignment  statement  S   is  said  to  be  data  depend- 

ent  on  another  asignment  statement  S   if  IN(S  ) \\  OUTCS^)  =  x  ^  0  and  the 

p        q         P 

value  computed  in  S   for  x  is  used  in  IN(S  ).   We  denote  this  by  S  =*>S  . 

P  q  P    q 

If  we  have  x  =  OUT(S  ) ,  x  e  IN(S  ) ,  and  the  value  of  x  computed  in  S  is 


not  used  in  S   then  S   is  data  antidependent  on  S  .   Antidependence 


xs 


denoted  by  S  -/"""S  .   S   is  said  to  be  data  output  dependent  on  S  ,  S  =$=> 

S  ,  if  x  =  OUT(S  )  =  OUT(S  )  and  the  value  calculated  in  S   is  stored  in 

x  after  the  value  which  is  calculated  in  S  .   If  x  =  OUT(S  ),  is  a  scalar 

variable  then  testing  for  dependence  between  S  and  statement  S   is  simple 

and  only  involves  name  searching  for  x  in  IN(S  )  and  OUT(S  )  and  finding 

the  order  of  execution  of  S  relative  to  S  .   If  x  is  an  array  element 

then  the  value  of  its  subscripts  in  S  and  S   should  be  identical  in 

order  for  a  dependence  to  exist. 

The  definition  of  dependence  relations  can  be  extended  to  cover 

statements  in  loops.   Let  us  use  S  [i  ,  i„,  ...,  i, ]  to  denote  the  instance 

of  statement  Sp  during  the  particular   iteration  when  I,  =  i, ,  I2  =  i2' 

...,  I,  ■  i,.   I..  ,  I0,  ....  I,  are  the  index  variables  of  the  loop.   Let 
d    d    1    I  a 
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x  =  OUT(S  (k..  ,  k0,  ...,  k,)).   If  we  use  the  notation  S   $  S   to  denote 
p   1   2        d  pq 

that  S   is  executed  before  S   then  we  have  [KUCK78]: 
P  q 

1)  S   =>S      if   x   eIN(S    a    ,1    ,    .  ..,    I   ))  and   S    (k.  ,    k_ ,    ...,    k   )    * 

pq  q      1      z  d  plz  a 

Sq(ill'    %2*       "'    Zd) 

2)  S  =^>S      if  x  EIN(S    a.,   l0,    .  ..,    I,))  and  S    (I        I    ,    ...,    I)    % 

qp  qlz  d  q!2  a 

sp(k1,  k2,   ...,  kd) 

3)  S  =#>S      if  x  =  OUT(S    (£.,   £_,    ...,   £,))    and  S    (L,    k„,    ...,   k.)    $ 

pq  qlz  d  plz  a 

s,ar  l2 V 


Testing  for  dependence  between  statements  within  loops  can  be  done 
by  unrolling  the  loop  and  listing  each  statement  for  each  iteration  of  the 
loop.   Each  statement  can  be  checked  with  following  statements  for  data 
dependence  as  described  earlier.   This  testing  procedure  is  lengthy  and 
expensive.   Tests  for  data  dependence  can  be  performed  without  actual 
unrolling  of  the  loop.   For  array  variables  this  involves  testing  the  sub- 
script expressions  for  the  set  of  values  which  the  index  variable  can  take. 
In  [BANE76]   sufficient  and  necessary  conditions  for  dependence  are  derived 
for  index  expressions  that  are  linear  functions  of  one  index  variable. 
For  the  rare  case  when  the  subscript  expression  is  more  complex  or  the  sub- 
scripts are  array  elements,  data  dependence  is  usually  assumed. 

To  simplify  the  testing  procedures  in  [BANE76]  it  is  assumed  that 
the  subscript  expressions  are  functions  only  of  the  index  variables.   More- 
over, the  increment  of  an  index  variable  between  one  iteration  and  the  next 
is  assumed  to  be  1.    In  [WOLF78]  several  transformations  are  described 
to  ensure  that  index  variables  and  subscript  expressions  satisfy  these  con- 
ditions. 
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In  the  previous  discussion  it  was  implicitly  assumed  that  the  loops 
are  IF-free.   In  [TOWL76]   procedures  for  removing  IF  statements  from  the 
scope  of  loops  are  described.   Some  types  of  IFs  cannot  be  removed  and  in 
such  situations  it  is  currently  assumed  in  the  PARAFRASE  compiler  that  all 
statements  in  the  loop  are  interdependent.   Research  to  improve  the  treat- 
ment of  IFs  is  still  going  on. 

The  data  dependence  relations  between  statements  in  a  block,  of 

assignment  statements  or  a  loop  can  be  represented  by  a  data  dependence 

graph  G.   Each  assignment  statement  S  is  represented  by  a  node  in  the 

graph.   If  S.=*  S.  we  draw  a  directed  arc  of  the  type  ->  from  the  node 

representing  S.  to  the  S.  node.   An  arc  of  the  form  -0->  is  drawn  from  S 

to  S .  if  S.=#*>  S.,and  an  arc  of  the  type  — /->  is  drawn  from  S.  to  S.  if 

S.  =^-   S..   Figure  9  shows  a  loop  and  its  data  dependence  graph.   Note 

the  cycle  in  the  graph.   In  general  a  cycle  can  exist  in  a  graph  if  there 

are  two  statements  S  and  S   such  that  the  relations  S  A  S  and  S  A  S 

p      q  P    q      q    P 

are  both  true.   The  relation  S  A  S   is  defined  by: 

x    y 

S   (dependence  operator)  S.,   (dependence  operator)  ...(dependence 

operator)  S.  (dependence  operator)  S  ,  n  >  0. 
in  y    — 

The  dependence  operator  can  be  any  of  =*• ,  =?tt>,  or  =#=>.   The  A  rela- 
tion can  be  used  to  partition  the  nodes  of  a  data  dependence  graph  into  a 
set  of  node  partitions.   Two  nodes  representing  statements  S  and  S  are 

in  the  same  node  partition,  called  a  TT-block,  if  and  only  if  S.  A  S„  and 

k    I 

S  A  S  .   In  other  words  all  the  nodes  which  are  in  a  cycle  of  the  graph 
belong  to  the  same  it  -block.   A  node  which  is  not  in  a  cycle  is  a  TT- 
block  by  itself.   Later  in  this  chapter  an  algorithm  will  be  presented 
to  distribute  the  loop  control  on  its  tt  -blocks. 
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DO   1   I  =  2,  3 

A(I)  =  B(I-i)*3  +  C(I) 

Sll 

A(2) 

=  B(l)*3  +  C(2) 

C(I)  =  A(I+1)*3 

S21 

C(2) 

=  A(3)*3 

B(I)  =  C(I)+A(I)  +  B(I) 

S31 

B(2) 

=  C(2)*A(2)  +  B(2) 

CONTINUE 

S12 

A(3) 

=  B(2)*3  +  C(3) 

S22 

C(3) 

=  A(4)*3 

s,„ 

B(3) 

=  C(3)*A(3)  +  B(3) 

Figure  9.   A  Loop,  Its  Unrolled  Version,  and  Its  Data  Dependence  Graph 
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3.2   Clustering  of  Assignment  Statements  Algorithm 

Programmers  tend  to  group  in  the  same  loop  different  assignment 
statements  which  perform  similar  operations  on  different  sets  of  arrays. 
Very  obvious  examples  of  such  loops  are  initialization  loops  where  dif- 
ferent arrays  of  similar  dimensions  are  initialized.   This  situation  can 
also  occur  in  loops  where  much  more  sophisticated  calculations  are  per- 
formed.  Examples  of  these  loops  are  those  performing  similar  calcula- 
tions on  real  and  imaginary  parts  of  complex  arrays. 

The  clustering  transformation  is  designed  to  separate  the  set  of 
statements  inside  a  loop  into  several  subsets  such  that  in  each  subset  a 
different  group  of  arrays  will  be  referenced.   Each  subset  thus  formed  is 
called  a  name  partition  (NP) .   The  transformation  is  applied  to  the  loops 
of  the  program  one  at  a  time.   The  aim  is  to  reduce  the  memory  require- 
ments of  the  program. 

3.2.1  Definitions  and  Notations 

Before  describing  the  algorithm  we  make  some  definitions.   For  a 
particular  loop  L,  let 

be  the  ordered  set  of  assignment  statements  controlled  by  L.   For  statement 

S.,  i  is  the  ordering  number.   The  set 

A(L)  =  (a  ,a  ,  ...,  a  ,  ...,  a^) 

is  the  set  of  arrays  referenced  in  L.   If  a  is  a  subset  of  S  ,  then  the 

Jj 

set  of  arrays  referenced  in  a  are  denoted  by  A(a) . 


Definition  3.1  The  name  partitions  of  the  loop  L,  are  a  set  (NP   NP  , 
...,  NP  )  of  subsets  of  SL  with  the  following  properties: 
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w     \-  U»pq 

q=l      H 
(ii)        NP.D  NP.    =   <j>   for   all   1  <   i,   j    <  k,    i  t  j 

(iii)     A(NP.)0a(NP.)    =  4  for  all  1  £  i,   j  <  k,   i  4  j 

k 
(iv)        A(L)    =    \j  A(NP   ) 
q=l  q 

(v)  If  S.eNP  and  S.eNP„  then  there  is  no  data  dependence  or  data 

l    q      j    £ 

antidependence  between  S.  and  S.  due  to  scalar  variables. 

i      J 

3.2.2   The  Clustering  Algorithm 

It  is  obvious  from  definition  3.1  that  the  control  of  a  loop  L 
can  be  distributed  over  its  NP's.   The  order  of  execution  of  the  result- 
ing loops  will  be  arbitrary.   The  NP's  of  a  loop  can  be  found  by  con- 
structing an  undirected  clustering  graph  according  to  the  following 
algorithm: 

(i)    Corresponding  to  each  assignment  statement  draw  a  node  and 

label  it  with  the  label  of  the  statement, 
(ii)   For  each  array  a  referenced  in  the  loop  make  a  list,  La.,  of 

the  statements  in  which  a.  is  referenced. 

i 

(iii)  Take  every  list  formed  in  (ii)  and  travel  through  the  nodes  re- 
presenting the  statements  in  the  list.   When  moving  from  one 
node  to  the  next  draw  an  undirected  arc  if  no  such  arc  existed 
because  of  a  previous  list. 

(iv)   Draw  an  arc,  if  one  was  not  already  drawn,  between  the  nodes 

of  any  two  assignment  statements  if  there  is  a  data  dependence, 
or  antidependence  between  the  two  statements  due  to  a  scalar 
variable. 
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(v)    Divide  the  nodes  of  the  graph  into  clusters.   Each  cluster  will 

represent  one  NP  and  will  contain  the  maximum  number  of  connected 
nodes.   Thus  every  pair  of  nodes  in  a  cluster  will  be  connected 
either  directly  or  through  other  nodes  which  belong  to  the  same 
cluster. 
Figure  10  shows  an  example  of  applying  the  clustering  algorithm 
to  a  loop.   We  note  that  the  worst  case  complexity  of  the  clustering 
algorithm  is  0 (number  of  statements  of  the  loop*  number  of  variables 
referenced  in  the  loop) . 

We  now  elaborate  on  the  usefulness  of  the  clustering  transforma- 
tion in  reducing  the  cost  of  execution  of  multi-NP  loops.   If  the  original 
loop  was  assigned  a  number  of  page  frames  equal  to  its  critical  memory 
allotment,  then  one  needs  to  assign  to  the  transformed  program  only  the 
maximum  of  the  critical  memory  allotments  of  the  resulting  NP's.   With 
this  memory  allotment  the  amount  of  I/O  transfers  will  be  the  same  for 
the  original  and  transformed  programs.   Thus  the  space-time  cost  of  the 
program  will  also  be  reduced  by  almost  the  same  amount  as  its  space  re- 
quirement.  This  is  true  because  the  increase  in  the  CPU  time  due  to  the 
additional  control  statements  of  the  transformed  program  is  not  signifi- 
cant.  One  can  establish  a  bound  on  the  reduction  of  the  space  and  the 
space-time  cost.   This  is  expressed  in  the  following  theorem: 
Theorem  3.1  The  upper  bound  on  the  improvement  in  the  space  requirement 
and  the  space-time  cost  of  a  loop  due  to  the  clustering  transformation  is 
a  factor  of  K,  where  K  is  the  number  of  name  partitions  generated  by  the 
clustering  algorithm. 
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6 
10 


S8 
S9 
S10 
Sll 

S12 

20 


DO  20  J  =  1,  NY1 

DO  10   I  =  1,  NX 

QVT1  =  QV(I,  J)  +  TS*QV1(I,  J) 

QCT1  =  QC(I,  J)  +  TS*QC1(I,  J) 

QV(I,  J)  =  QV1(I,  J) 

QC(I,  J)  =  QC1(I,  J) 

QV1(I,  J)  =  QVT1 

QC1(I,  J)  =  QCT^ 

CONTINUE 

QV(NX,  J)  =  QV(1,  J) 

QC(NX,  J)  =  QC(1,  J) 

QV(NXP,  J)  =  QV(2,  J) 

QV(NX+2,J)  =  QV(3,  J) 

QC(NXP,  J)  =  QC(2,  J) 

QC(NX+2,J)  =  QC(3,J) 

CONTINUE 


LQV  =  (Sv    Sy    S?,  Sg,  S1Q) 
LQV1  =  (S1,  S3,  S5) 
LQC  =  (S2,  S4,  Sg,  Sn,  S12) 
LQC1  =  (S2,  S4,  Sfi) 


NP1 
NP2 


1'   3*   5'   7'   9'   10 

(s2,  s4,  s6,  sg,  su,  s12) 


Figure  10.   A  Loop  and  Its  Clustering  Graph. 
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DO  201  J  =  1,  NYl 
DO   101   1=1,  NX 

51  QVT1  =  QV(I,  J)  +  TS*QV1(I  ,J) 
S3      QV(I,  J)  =  QV1(I,  J) 

55  QV1(I,  J)  =  QVT1 

101  CONTINUE 

S?      QV(NX,  J)  =  QV(1,  J) 
Sg      QV(NXP,  J)  =  QV(2,  J) 
S1Q     QV(NX+2,  J)  =  QV(3,  J) 

201  CONTINUE 

DO  202   J  =  1,  NYl 
DO   102   I  =  1,  NX 

52  QCT1  =  QC(I,  J)  +  TS*QC1(I,  J) 
S^      QC(I,  J)  =  QC1(I,  J) 

56  QC1(I,  J)  =  QCT^ 

102  CONTINUE 

Sg      QC(NX,  J)  =  QC(1,  J) 
S±1  QC(NXP,  J)  =  QC(2,  J) 

S12     QC(NX+2,  J)  =  QC(3,  J) 

202  CONTINUE 


Figure  11.   Distributing  the  Control  of  the  Loop  in  Fig.  10  on  Its  NP's 
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Proof: 

The  critical  memory  requirement  of  the  original  program,  m   , 

OL 

is  0(//  of  different  array  names  in  the  loop).   For  the  transformed  pro- 
gram the  critical  memory  requirement,  m  ,  is  determined  by  the  maximum 
of  the  number  of  array  names  in  the  different  resulting  NP's.   If  K  is 

the  number  of  NP's,  then  the  smallest  value  which  m  can  take  is  (m  T /K) . 

o  oL 

Since  the  clustering  algorithm  does  not  change  the  I/O  time,  the  space- 
time  cost  will  also  drop  by  a  factor  of  K. 

Figure  11  shows  the  loop  of  Figure  10  with  the  control  distributed 

on  the  NP's.   Obviously  the  space  and  space-time  cost  are  reduced  by  a 

factor  of  2. 

3. 3  Fusion  of  Name  Partitions 

3.3.1  The  Usefulness  of  the  Fusion  Transformation 

The  aim  of  this  transformation  is  to  reduce  I/O  time  without  in- 
creasing the  memory  requirements  of  a  program.   This  is  achieved  by  combin- 
ing in  one  name  partition  several  name  partitions  from  different  loops. 
The  memory  requirements  of  the  combined  name  partition  will  not  exceed 
the  maximum  memory  needs  of  the  individual  NP's.   As  an  example  consider 
the  following  loops  taken  from  a  Fast  Fourier  Transform  program: 
Program  9-a. 

DO  6   K  =  Kl,  N2N,  NDISP 
KPNG  =  K  +  NG 
S1   CR(KPNG)  =  CR(K)  -  STOUTR(K) 
S    CI(KPNG)  =  CI(K)  -  STOUTI(K) 
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6    CONTINUE 

DO   8     K  =  Kl,  N2N,  NDISP 
S    CR(K)  »  CR(K)  +  STOUTR(K) 

5  CI(K)  =  CI(K)+  STOUTI(K) 

6  CONTINUE 


Using  the  clustering  algorithm  we  get  two  NP's  from  the  first  loop: 

NP11  =  ^Sl^  and  NP12  =  ^S2^'   We  also  get  two  NP'S  from  the  second  1°°P: 
NP21  =  ^S3^  and  NP22  =  ^S4^ '   If  we  distriDute  the  1°°P  control  on  the 
NP's  we  get  the  following  program: 

Program  9-b . 

DO   61   K  =  Kl,  N2N,  NDISP 

61  CR(K  +  NG)  =  CR(K)  -  STOUTR(K) 
DO   62   K  =  Kl,  N2N,  NDISP 

62  CI(K  +  NG)  =  CI(K)  -  STOUTI(K) 
DO  =  81   K  =  Kl,  N2N,  NDISP 

81  CR(K)  =  CR(K)  +  STOUTR(K) 

DO   82   K  =  Kl,  N2N,  NDISP 

82  CI(K)  =  CI(K)  +  STOUTI(K) 

The  critical  memory  allotment  of  the  first  loop  in  the  original 
program  is  four  page  frames  and  the  total  number  of  page  faults  is  4*K, 
K  is  the  number  of  pages  spanned  by  each  array.   The  second  loop  has  simi- 
lar memory  requirements  and  number  of  page  faults.   Thus  the  original 
program  can  execute  in  four  page  frames,  the  total  number  of  page  faults 
is  8*K,  and  the  total  space-time  cost  is  32*K.  After  applying  the 
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clustering  transformation,  Program  9-b  needs  only  two  page  frames  to 
execute  without  changing  the  number  of  page  faults.   Thus  with 
clustering  we  have  achieved  an  improvement  of  a  factor  of  two  in  the 
memory  requirement  and  space-time  cost,  without  increasing  the  I/O  time. 

If  we  examine  the  arrays  being  referenced  in  the  NP's,  we  find  that 
A(NP   )  =  A(NP   )  =  (CR,STOUTR)  and  A(NP   )  =A(NP22)  =  (CI,STOUTI).  More- 
over, the  loop  structure  of  NP   and  NP91  is  identical;  the  nesting  levels, 
the  starting  values  of  the  index  variables,  the  increment  values,  and 
the  upper  bound  of  the  index  set  values,  are  all  identical.   We  also  notice 
that  there  are  no  data  dependences  between  NP19  abd  NP-...   Thus  it  is  valid 
to  combine  NP  -  and  NP~   in  one  name  partition,  NP  .   Because  of  similar 
arguments  we  can  combine  NP   and  NP  „  in  a  single  name  partition,  NP~. Hence 
after  NP  fusion  the  program  will  be  transformed  to  the  following: 

Program  9-c . 

DO    1   K  =  Kl,  N2N,  NDISP 
CR(K  +  NG)  =  CR(K)  -  STOUTR(K) 

1  CR(K)  =  CR(K)  +  STOUTR(K) 
DO   2   K  =  Kl,  N2N,  NDISP 
CI(K  +  NG)  =  CI(K)  -  STOUTI(K) 

2  CI(K)  =  CI(K)  +  STOUTI(K) 

The  memory  requirement  of  Program  9-c  is  the  same  as  that  of  Pro- 
gram 9-b,  namely,  two  page  frames.   Program  9-c,  however,  will  produce 
less  page  faults:   a  total  of  4*K  page  faults  compared  to  8*K  page  faults 
for  the  clustered  and  the  original  program.   Table  A  compares  the  memory, 
I/O,  and  space-time  cost  of  Programs  9-a,  9-b,  and  9-c.   We  note  that  by 
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Table  4-   Resource  Requirements  of  Programs  9-a,  9-b,  9-c 


Critical  Memory        Total  Number  of       Space-Time 
Allotment  Page  Faults  Cost 


Original 

Program  4  8*K  32*K 

Clustered 

Program  2  8*K  16*K 

Fusion 

Applied  to 

the  Clustered 

Program  2  4*K  8*K 


using  NP  fusion  of  the  clustered  program  we  have  improved  the  memory  re- 
quirement, I/O,  and  space-time  cost  of  the  original  program  by  factors  of 
2,  2,  and  4  respectively. 

3.3.2  Notation  and  Definitions 

After  illustrating  the  usefulness  of  the  fusion  transforma- 
tion, we  discussed  some  definitions  relevant  to  the  general  fusion  algo- 
rithm.  The  program  is  divided  into  a  set  of  basic  blocks.   A  basic 
block  is  defined  as  a  section  of  code  with  only  one  point  of  entry  and 
one  point  of  exit.   It  contains  a  sequence  of  loops  and  possibly  groups 
of  assignment  statements  outside  the  loops.   The  fusion  algorithm  is 
applied  to  one  basic  block  at  a  time.   This  is  preceded  by  applying  the 
clustering  algorithm  to  the  loops  of  the  basic  block.   Let  the  number  of 
loops  in  a  basic  block  be  m,  m  >_  1.   For  loop  L,,  l<k<m,  the  clustering 
algorithm  finds  its  set  of  n,  name  partitions,  n,  >_   !•   These  are  denoted 

by  NP,  n,  NP,  „,  .  .  .  ,  NP,   .   The  set  of  arrays  references  in  NP   is 
kl    k2         kn,  Ki 

denoted  by  A(NP,  ) . 
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Although  the  NP's  of  one  loop  are  by  definition  data  independent, 

dependence  relations  can  exist  between  NP's  from  different  loops  of  a 

basic  block  of  code.   A  name  partition,  NP   is  data  dependent  on  another 

name  partition,  NP  .,  (k  ^  q)  if  and  only  if  there  exists  at  least  one 

statement  in  NP,  .  which  is  data  dependent  on  a  statement  in  NP  . .   We 
kx  r  qj 

denote  this  by  NP  .  =^>  NP,  . .   Similarly  a  data  antidependence  and  data 
qj      ki 

output  dependence  can  exist  between  the  name  partitions  NP,  .  and  NP  .  if 

ki       qj 

and  only  if  there  exists  at  least  one  statement  in  NP,  .  which  is  data 

ki 

antidependent  or  data  output  dependent  on  a  statement  in  NP  . .   If  NP,  . 

qj        ki 

is  data  antidependent  on  NP  .  then  this  is  denoted  by  NP,  .  =fc>   NP  . . 

qj  ki       qj 

NP  .  =#*•  NP,  .  means  that  NP,  .  is  data  output  dependent  on  NP  .  . 
qj      ki  ki  qj 

3-3.3   Correctness  of  Fusing  Two  Name  Partitions 

Before  presenting  the  fusion  procedure,  let  us  discuss  the  ques- 
tion of  the  correctness  of  fusing  two  NP's,  NP,  .  and  NP  .  (k  <  q) .   When 

ki       qj 

we  fuse  the  two  NP's  we  add  the  set  of  statements  of  NP  .  to  those  of  NP 

qj  ki' 

i.e.,  NP,  .  =  NP,  .U  NP  ..   The  fusion  of  the  two  NP's  will  be  valid  if  the 
ki     ki    qj 

following  conditions  are  satisfied. 

(A)   The  control  structure  of  NP,  .  and  NP  .  is  identical.   This  means 

ki       qj 

that  the  index  variable  sets,  and  the  nesting  structure  for  the  two  NP's 
are  identical. 

(n)      if  (mp   mp         mp    ^ 

Kdj  ^  V   £+1'  "•'   £+g;  is  the  set  of  NP's  between  NPkl  and 

NP  .  then  there  is  no  data  dependence,  antidependence,  or  output  dependence 

between  NP  .  and  any  NP  in  this  set.   Moreover,  there  is  no  dependence 

between  any  assignment  statement  in  NP  .  and  any  assignment  statement  which 

occurs  outside  NP's  and  between  NP,  .  and  NP  .. 

ki       qj 
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We  now  present  the  general  fusion  algorithm.   Again,  we  have  m 
loops  in  a  basic  block  of  code  (L1 ,  L2  . . . ,  L  ) .   Each  loop  L,  has  n, 
name  partitions,  (NP,,,  NP,-,  ...,  NP   ). 

3.3.4  The  Fusion  Algorithm 

(i)   Set  k  =  1,  I  -  1,  i  =  2 

(ii)   Compare  A(NP  )  with  A(NP  ) .   If  A(NP  )  CLA(NP . J  or  A(NP±1)  ^ 
A(NP,  p)  then  test  for  the  correctness  of  fusing  NP  p  and  NP.... 
If  the  fusion  can  be  done,  replace  A(NP  ^)  by  A(NP  j,)L)A(NP   ) 
and  NP   by  NP,  UNP   .   Decrement  n.  and  eliminate  NP...  from 
the  set  of  NP's  of  loop  i.   If  n.  =  o  then  decrement  m. 

(iii)  Repeat  step  (ii)  by  considering  NP   and  NP .  . ,  j  =  2,  ...,  n  . 

(iv)   If  i  =  m  go  to  step  (v)  else  increment  i  and  go  to  step  (ii) . 

(v)   If  £  =  n,  go  to  step  (vi)  else  increment  2,   and  go  to  step  (ii). 

(vi)   If  k  =  m  exit,  else  increment  k  and  go  to  step  (ii) . 

We  note  that  the  complexity  of  this  algorithm  is  0( (total  number 

2 
of  NP's  in  the  basic  code  block)  ). 

3.4  Scalar  Transformations 

Programmers  usually  introduce  assignment  statements  with  scalar 
output  variables  inside  loops  for  different  reasons.  A  scalar  variable 
can  be  used  as  a  temporary  to  hold  the  value  of  an  expression  which  is 
common  to  several  assignment  statements.   Sometimes  the  right-hand  side 
expression  of  an  assignment  statement  is  very  long  and  programmers  pre- 
fer to  divide  the  expression  into  parts  to  improve  the  readability  of  the 
program.   Every  part  is  assigned  to  a  scalar  variable  and  these  are  used 
in  the  right-hand  side  expression  of  the  assignment  statement.   In  another 
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possibility  the  assignment  statement  to  the  scalar  variable  can  be  a 
recurrence. 

3.4.1  The  PARAFRASE  Compiler  Scalar  Transformations 

As  will  be  shown  in  the  next  section,  distributing  the  loop  con- 
trol of  an  NP  on  its   TT-blocks  can  be  used  to  reduce  the  amount  of  memory 
required  to  execute  the  NP.   In  the  PARAFRASE  compiler  several  techniques 
are  used  to  remove  the  arcs  in  the  data  dependence  graph  of  a  loop  which 
are  due  to  assignment  statements  to  scalar  variables.   This  will  simplify 
the  graph  and  reduce  the  number  of  statements  included  in  a  tt -block. 
This  is  useful  to  us  because,  the  smaller  the  number  of  statements  in  the 
tt  -blocks  of  an  NP,  the  smaller  the  amount  of  memory  which  is  needed  for 
its  execution.   Of  the  techniques  used  in  the  PARAFRASE  compiler  to  break 
data  dependences  due  to  scalars  we  use  (without  modification)  the  scalar 
renaming,  induction  variable  substitution  and  forward  substitution  of 
right-hand  sides  of  assignments  statements  to  scalars  which  are  used  in 
subscript  expressions.   The  dead  code  elimination  pass  will  eliminate  the 
assignment   statements  to  those  scalars  treated  by  these  techniques. 

In  the  PARAFRASE  compiler  all  scalars  which  cannot  be  handled  by 
the  previous  three  techniques  will  be  expanded  into  array  variables. 
Figure  12~a  shows  an  example  program  and  its  data  dependence  graph.   Notice 
the  cycle  in  the  graph.   In  Figure  12-b  the  scalar  has  been  expanded  into 
an  array  of  size  N  and  thus  the  cycle  in  the  dependence  graph  has  dis- 
appeared.  The  distribution  algorithm  which  will  be  presented  in  the  next 
section  can  be  used  to  distribute  the  loop  of  the  program  in  Figure  12-b. 
The  program  in  12-a  is  undistributable. 
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"3 
10 


DO  10   I  =  1,  N 
T  =  A(I)  -  E(I) 
A(I)  -  B(I)*C(I) 
B(I)  =  T  +  F(I)/D(I) 
CONTINUE 


Figure  12 -a.   A  Loop  Including  an  Assignment  Statement  to  a  Scalar 
Variable  and  Its  Data  Dependence  Graph. 


3 

10 


DO  10   I  «  1,  N 

T'(I)  =  A(I)  -  E(I) 

A(I)  =  B(I)*C(I) 

B(I)  =  T'(I)  +  F(I)/D(I) 

CONTINUE 


Figure  12 -b.   The  Loop  of  Figure  12-a  After  Expanding  the  Scalar  and 
Its  New  Data  Dependence  Graph. 
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3.4.2  The  Scalar  Forward  Substitution  Transformation 

Figure  13-a  shows  another  example  program  and  its  data  dependence 
graph.   The  cycles  in  the  graph  are  again  due  to  an  assignment  statement 
to  the  scalar  variable  T.   One  can  still  use  the  scalar  expansion  technique 
to  simplify  the  data  dependence  graph  of  the  program  in  Figure  13-a.   In 
fact,  this  is  what  is  done  in  PARAFRASE.   However,  for  this  example  the 
right-hand  side  of  S   can  be  forward  substituted  in  S_  and  S„.   S..  can  then 
be  eliminated  from  the  loop.   This  is  shown  in  Figure  13-b.   In  PARAFRASE, 
this  technique  is  not  used  because  redundant  computation  might  be  intro- 
duced.  This  is  the  case  in  the  loop  of  Figure  13-a  because  the  scalar 
T  is  used  in  the  right-hand  side  of  two  statements.   Since  PARAFRASE  was 
written  to  speedup  program  execution,  forward  substitution  would  not  be 
a  suitable  transformation. 

When  people  are  compiling  for  parallel  or  pipelined  machines  they 
are  not  worried  too  much  about  the  increase  of  the  memory  requirements  of 
a  transformed  program  if  it  can  be  executed  on  a  parallel  machine  much 
faster  than  the  original  program  on  a  serial  machine.   In  this  thesis  we 
are  concerned  with  compiler  transformations  for  serial  virtual  memory 
computers.   We  are  interested  in  a  modified  version  of  the  PARAFRASE  loop 
distribution  transformation.   In  the  next  section  we  will  describe  our 
distribution  transformation,  the  vertical  distribution  algorithm.   Hence 
we  are  also  interested  in  techniques  to  break  data  dependences  in  a  loop 
which  are  introduced  by  assignment  statements  to  scalar  variables.   How- 
ever, we  are  concerned  here  with  the  memory  requirement  of  the  program 
and  its  I/O  activity. 
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10 


DO   10   I  =  1,  N 

T  =  A(I)*C(I) 

D(I)  -  D(I)**2  -  T**.5 

F(I)  =  T*(A(I)  -  C(I))  +  F(I)/C(I) 

CONTINUE 


Figure  13-a.   Another  loop  with  an  Assignment  Statement  to  a  Scalar 
Variable  and  Its  Data  Dependence  Graph. 


10 


DO  10  I  -  1,  N 

D(I)  -D(I)**2-  (A(I)*C(I))**.5 

F(I)  -  (A(I)*C(I))*(A(I)  -  C(I)) 

+  F(I)/C(I) 
CONTINUE 


© 
© 


Figure  13-b.  The  Use  of  Forward  Substitution  to  Simplify  the  Data 
Dependence  Graph  of  Fig.  13-a. 
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Our  approach  will  be  to  use  the  forward  substitution  technique  in 
some  situations  and  a  modified  version  of  the  scalar  expansion  technique 
in  other  situations.   Shortly  we  will  give  some  rules  to  be  used  in  decid- 
ing what  to  do  for  every  specific  case.   Before  presenting  these  rules 
we  make  one  observation  and  then  explain  our  modification  to  the  scalar 
expansion  transformation. 

3.4.2.1   Correctness  of  the  Forward  Substitution  Transformation 

We  note  that  the  scalar  expansion  transformation  can  be  applied 
to  any  scalar  output  variable  of  any  assignment  statement.   This  transfor- 
mation is  always  correct  as  long  as  all  references  to  the  expanded  scalar 
are  replaced  by  references  to  appropriate  elements  of  the  resulting  array. 
The  details  of  the  scalar  expansion  algorithm  can  be  found  in  [WOLF78]. 
The  forward  substitution  transformation  on  the  other  hand,  cannot  be  used 
in  all  cases.   For  example  it  cannot  be  applied  to  the  program  in  Figure 
12-a.   To  address  the  correctness  of  the  forward  substitution  transforma- 
tion we  make  the  following  definitions. 

Definition  3.2   If  the  output  variable  of  an  assignment  statement  is  a 

scalar  variable  x,  then  this  statement  is  called  the  source  statement  of 

the  scalar  x,  S  .   The  set  of  arrays  referenced  in  S   is  denoted  by  AS  . 
_x  J  x  x 

Definition  3.3  A  destination  statement  of  a  scalar  variable  x,  D  ,  is  an 
x 

assignment  statement  which  is  data  dependent  on  S  .   In  other  words  S  =* 

X  X 

D  .   We  denote  the  set  of  array  referenced  in  D  by  AD  . 
x  J  xx 

The  necessary  and  sufficient  condition  for  the  correctness  of 
the  forward  substitution  transformation  can  now  be  stated  as  follows:   If 
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the  source  statement  of  a  scalar  variable  x,  S  ,  is  not  a  recurrence  then 
its  right-hand  side  expression  can  be  forward  substituted  in  a  destina- 
tion statement  of  x,  D   if  and  only  if  there  is  no  statement  executed 

after  S  and  before  D  which  is  antidependent  on  S  .   If  this  condition 
x x K x 

is  satisfied,  then  none  of  the  input  variables  set  of  S  changes  its 

value  before  the  execution  of  D  and  the  substitution  will  be  valid. 

x 

3.4.3  Modifying  the  Scalar  Expansion  Transformation 

When  a  scalar  variable  is  expanded  into  an  array  in  the  PARAFRASE 
compiler  a  different  element  of  the  array  is  used  for  every  iteration  of 
the  loop.   Thus  for  example,  in  Figure  12-b  the  expansion  array,  T',  will 
be  of  size  N.   For  execution  on  a  parallel  machine  the  loop  can  be  distri- 
buted as  shown  in  Figure  14-a.   The  distributed  loop  can  be  executed  in  4 
time  steps  on  a  parallel  machine  with  N  processor  (we  assume  that  perform- 
ing any  arithmetic  operation  takes  one  time  step).  On  a  serial  machine, 
the  program  of  Figure  12-a  takes  4*N  time  steps  to  be  executed.   Thus  a 
speedup  of  a  factor  of  N  has  been  achieved  by  distributing  the  loop. 

In  Section  3.5.1  we  will  show  that  although  this  kind  of  distri- 
bution, which  we  will  call  horizontal  distribution,  can  result  in  reducing 
m  of  a  loop,  it  may  increase  the  I/O  activity  and  possibly  the  space- 
time  cost  of  the  execution.   In  the  same  section  we  will  modify  the  dis- 
tribution algorithm  to  avoid  any  increase  in  the  I/O  activity  and  to  ensure 
a  reduction  in  the  space-time  cost.   We  will  call  our  modified  distribution 
algorithm,  vertical  distribution. 

Figure  14-b  shows  the  vertically  distributed  version  of  Program 
12-a.   By  using  vertical  distribution  the  size  of  the  expanded  scalar  need 
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DO        S         I   m   1,    N 

sl 

T'(I)    =  A(I)    -  E(I) 

DO        S         I   =    1,    N 

S2 

A(I)    =   B(I)*C(I) 

DO        S        I   =   1,    N 

S3 

B(I)    -   T'(I)    +  F(I)/D(I) 

Figure  14-a.   The  Loop  of  Figure  12-b  After  Applying  the  Horizontal 
Distribition  Transformation. 


DO  10   IP  =  1,  fN/Zl 
ILB  =  1  +  (IP-1)*Z 
IUB  =  MIN(IP*Z,  N) 
DO  S1      I  =  ILB,  IUB 
S1      T'(I  MOD(Z)  +  1)  =  A(I)  -  E(I) 


1 


DO   S    I  =  ILB,  IUB 


52  A(I)  =  B(I)*C(I) 

DO   S    I  =  ILB,  IUB 

53  B(I)  =  T'(I  MOD(Z)  +  1)  +  F(I)/D(I) 

10      CONTINUE 

Figure  lA-b.   The  Vertically  Distributed  Version  of  the  Loop  in  Figure 
12-b. 


86 
only  be  Z  words,  one  page  size.   The  expression  (I  M0D(Z)+1)  is  used  as 

a  subscript  expression  for  the  expanded  array.   Thus,  with  4  page  frames, 

the  execution  of  the  program  in  Figure  14-b  starts  by  using  the  first  page 

of  A  and  the  first  page  of  E  to  compute  one  page  of  T'  in  S..   In  S?  the 

first  page  of  B  and  the  first  page  of  C  are  used  to  modify  the  first  page 

of  A.   In  S„  the  page  of  T'  is  used  with  the  first  page  of  F  and  the  first 

page  of  D  to  write  into  the  first  page  of  B.   In  the  second  iteration  of  the 

outermost  loop,  the  IP  loop,  the  second  pages  of  the  arrays  A,  B,  C,  D,  E, 

and  F  will  be  processed.   However,  the  same  page  of  T1  can  be  again  utilized 

to  hold  the  temporary  Z  values  computed  in  S..  to  be  used  in  S„.   This  will 

be  true  for  all  iterations  of  the  IP  loop. 

Thus  the  difference  between  expanding  scalars  in  parallel  machine 
transformations  and  in  virtual  memory  computer  transformations  is  that  the 
size  of  the  expansion  array  in  the  latter  case  is  less  than  or  equal  to  one 
page  size. 

As  mentioned  before,  the  details  of  the  expansion  algorithm  are 
found  in  [WOLF78].   We  use  the  same  algorithm  except  for  reducing  the  size 
of  the  expansion  array.   In  Figure  15,  we  show  another  example  program  and 
its  vertically  distributed  version.   Note  that  we  have  expanded  the  output 
scalar  variable  of  statement  S..  into  an  appropriate  one  page  array. 

3.4.4  Choosing  Between  Scalar  Expansion  and  Forward  Substitution 

When  the  control  of  an  NP  is  vertically  distributed  on  its  tt- 
blocks,  the  critical  memory  allotment,  m  ,  for  each  7T-block  will  be 
roughly  equal  to  the  number  of  arrays  referenced  in  it.   Expanding  the 
scalar  output  variable  of  an  assignment  statement  into  an  array  will 
increase  m  of  the  TT-block  containing  this  assignment  statement  by  one 
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DO      10      I   =   1,    N 

DO      10      J   =   1,    N 

sl 

T   =  A(I,    J) 

S2 

A(I,    J)    =  A(J,    I)    +  B(I, 

J) 

S3 

A(J,    I)    =  T  +  C(I,    J) 

10 

CONTINUE 

Figure  15-a.   An  Example  Loop  . 

DO   10   IP  =  1,  fN/RZ] 

ILB  =  1  +  (IP-1)*RZ 

IUB  =  IP*RZ 

DO   10   JP  =  1,  [N/RZ1 

JLB  =  1  +  (JP-1;*RZ 

JUB  =  JP*RZ 

DO  S    I  =  ILB,  IUB 

DO  S   J  =  JLB,  JUB 
S       T'(I  MOD(RZ)  +  1,  J  MOD(RZ)  +  1)  =  A(I,  J) 

DO   S    I  =  ILB,  IUB 

DO   S   J  =  JLB,  JUB 
S2      A(I,  J)  =  A(J,  I)  +  B(I,  J) 

DO  S    I  =  ILB,  IUB 

DO  S   J  =  JLB,  JUB 

A(J,  I)  =  T'(I  MOD(RZ)  +  1,  J  MOD(RZ)  +  1)  +  C(I,  J) 

10      CONTINUE 

Figure  15-b.   The  Vertically  Distributed  Version  of  the  Loop  in 
Figure  15-a. 
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page  frame.   Moreover,  all  references  to  the  scalar  variable  in  other  ir- 
blocks  must  be  replaced  by  references  to  the  appropriate  elements  of  the 
expansion  array.   Thus,  m  's  for  these  it  -blocks  will  also  be  increased. 
For  example  in  Figure  14-b  scalar  expansion  has  increased  the  number  of 
arrays  referenced  in  both  statements  S  and  S  .   However,  by  vertical  dis- 
tribution, which  is  possible  in  Figure  14-b  because  of  scalar  expansion, 
m  is  equal  to  4  instead  of  6  for  the  original  loop  in  Figure  12-a. 

If  forward  substitution  is  possible  and  if  AS  CH  AD  then  substi- 

x  —   x 

tuting  the  right-hand  side  expression  of  S  in  D  will  not  increase  m  of 

x     x  o 

D  .   If  this  can  be  done  for  all  the  destination  statement  of  x,  S   can 
x  x 

be  eliminated.   Otherwise  x  must  be  expanded  into  an  array  and  references 

to  x  in  those  statement  for  which  AS  <£-  AD  must  be  replaced  by  references 

to  appropriate  elements  of  the  expansion  scalar. 

From  the  previous  discussion  we  conclude  that  scalar  expansion 

should  not  be  done  unless  it  is  incorrect  to  use  the  forward  substitution 

transformation  or  if  AS  ct-  AD  for  some  of  the  destination  statements  of 

x —   x 

x. 

3.5   Distribution  of  Name  Partitions 

By  applying  the  clustering  and  the  fusion  transformations  to  a 
program  we  expect  to  reduce  its  I/O  activity,  space  requirement,  and 
space-time  cost.   At  this  point  of  the  transformation  process,  the  differ- 
ent NP's  of  a  basic  block  of  code  in  the  program  will  reference  different 
sets  of  arrays.   In  a  particular  NP,  however,  it  is  not  necessary  that  all 
the  arrays  of  the  NP  will  be  referenced  in  each  of  its  statements  or  even 
by  any  single  statement.   Thus  it  is  intuitive  that  by  distributing  the 
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control  of  an  NP  on  its  u-b locks  its  space  requirements  can  be  reduced  to 
be  roughly  equal  to  the  maximum  number  of  arrays  referenced  in  any  of  its 
u-blocks  instead  of  the  total  number  of  arrays  referenced  in  the  NP. 

In  Section  3.5.1  we  will  present  the  distribution  algorithm  as 
currently  implemented  in  the  PARAFRASE  compiler.   In  the  same  section  we 
will  differentiate  between  basic  and  nonbasic  Tr-b locks.   As  mentioned 
previously,  although  this  kind  of  distribution,  the  horizontal  distribu- 
tion, reduces  m  of  an  NP,  it  might  increase  its  I/O  activity  and  space- 
time  cost.   We  will  discuss  an  example  to  illustrate  this  point. 

For  NP's  with  basic  TT-b locks  we  describe  the  vertical  distribution 
algorithm  in  Section  3.5.2.   This  is  the  horizontal  distribution  algorithm 
modified  by  the  page  indexing  transformation.   Vertical  distribution  re- 
duces m  of  an  NP  but  does  not  increase  its  I/O  activity.   In  Section 
3.5.2.1  we  describe  the  algorithm  when  used  for  elementary  loops.   In  Sec- 
tion 3.5.2.2  we  discuss  the  algorithm  when  applied  to  multinested  loops 
in  which  multi-dimensional  arrays  are  referenced.   In  the  same  section  we 
illustrate  the  use  of  the  page  indexing  transformation  in  matching  the  pat- 
tern of  reference  of  multi-dimensional  arrays  to  their  storage  scheme. 
The  general  vertical  distribution  algorithm  is  presented  in  Section  3.5.2.3. 
Some  implementation  issues  will  also  be  considered  in  the  same  section. 
In  Section  3.5.2.4  we  present  two  theorems  to  be  used  in  testing  the 
correctness  of  applying  the  page  indexing  transformation. 

Transforming  NP's  with  nonbasic  TT-blocks  is  discussed  in  Section 
3.5.3. 
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3.5.1  Horizontal  Distribution  of  Name  Partitions 

We  apply  the  horizontal  distribution  algorithm  [KUCK78]  to  all 
NP's  in  which  the  set  of  arrays  of  the  NP  are  not  all  referenced  in  every 
statement  of  the  NP.   If  none  of  the  arrays  referenced  in  the  NP  is  a 
multi-page  array,  horizontal  distribution  will  be  the  last  transformation 
applied  to  the  NP.   Otherwise,  the  method  of  distributing  the  control  of 
the  NP  on  its  tt -blocks  will  be  modified  using  the  page  indexing  transfor- 
mation as  will  be  described  in  the  next  section. 

3.5.1.1  The  Horizontal  Distribution  Algorithm 

(i)    By  analyzing  the  subscript  expressions  and  the  index  set  for 

each  index  variable  of  the  NP  construct  its  data  dependence 

graph, 
(ii)   Identify  the  TT-blocks  of  the  NP  as  defined  in  Section  3.1.   We 

define  the  following  partial  ordering  relations  between  two  tt- 

b locks,  it.  and  tt  .  : 
i      J 

(a)  tt.  >  tt  .  if  and  only  if  there  exists  S.  ett.  and  Sn  £tt  . 

l    J  k   l      £   j 

such  that  S,  =>  s  • 

k     £ 

(b)  tt  .  £  tt  .  if  and  only  if  there  exists  S,  ett.  and  So  ett. 

i         j  ki  *       J 

such   that   S,    =#>   Sn . 

k  £ 

(c)  tt  .    >   tt.    if   and  only   if    there  exists   S.     ett.    and   S  „   ett. 

j    i  '  k   l      £   j 

such  that  S  =h   S,  . 

£      k 

We  order  the  Tr-blocks  of  the  NP  according  to  these  three  relations. 
Note  that  the  resulting  ordering  is  not  unique. 

(iii)  Distribute  the  NP  control  on  its  ordered  7r-blocks. 
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Figure  16  shows  an  NP,  its  data  dependence  graph,  and  its  horizon- 
tally distributed  version. 

3.5.1.2   The  Problem  with  Horizontally  Distributing  an  NP  with  Multi-page 
Arrays 

If  multi-page  arrays  are  referenced  in  different  TT-blocks  of  an 
NP,  then  the  number  of  page  transfers  will  be  increased  if  the  NP  is  hori- 
zontally distributed.   As  an  example  consider  the  program  in  Figure  17-a. 
The  critical  memory  allotment  for  this  NP ,  m  ,  is  equal  to  3  and  total 
number  of  page  faults  (using  the  LRU  replacement  algorithm)  is  3*[N/Zl. 
In  the  distributed  NP  of  Figure  17-b  ,  m   is  reduced  to  2  page  frames. 
However,  the  total  number  of  page  faults  is  increased  to  6*fN/Z~|.   The 
space-time  cost  is  increased  by  a  factor  of  (2*6* TN/Z 1) / (3*3* fN/Z 1  =  4/3. 
In  the  undistributed  NP,  statements  S  ,  S  ,  and  S  will  issue  all  their 
references  to  a  particular  page  of  the  A  array  while  this  page  is  in  main 
memory.   In  the  horizontally  distributed  version,  statement  S  will  issue 
its  references  to  an  A  page, then  this  page  will  be  replaced.   The  same 
page  will  be  reloaded  into  main  memory  when  it  is  referenced  by  S?  and 
again  when  it  is  referenced  by  S_.   Similarly  a  B  page  will  be  loaded 
twice,  once  when  it  is  referenced  in  S  and  again  when  it  is  referenced  in 
S  .   Note  that  the  distributed  program  will  have  no  problems  if  the  size 
of  each  array  was  one  page  or  less. 

Before  curing  the  increased  I/O  problem  by  adding  the  page  index- 
ing step  to  the  horizontal  distribution  algorithm,  let  us  differentiate 
between  basic  and  nonbasic  TT-blocks. 
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10 

(a)   The  NP 

DO  S 


DO   S    J  =  1,  NY1 

DO  S    I  =  1,  NX 

QVT1'(I,  J)  -  QV(I,  J)  +  TS*QV1(I,  J) 

QV(I,  J)  =  QV1(I,  J) 

QV1(I,  J)  =  QVTI'CI,  J) 

QV(NX,  J)  =  QV(1,  J) 

QV(NXP,  J)  =  QV(2,  J) 

QV(NX+2,  J)  =  QV(3,  J) 


(b)   The  Dependence  Graph 


J  =  1,NY1 


DO  S   I  =  1,  NX 

QVTl'(I,  J)  =  QV(I,  J)  +  TS*QV1(I,  J) 
DO  S   J  =  1,  NY1 
DO  S   I  =  1,  NX 
QV(I,  J)  =  QV1(I,  J) 
DO  S5  K  -  1,  NY1 
DO  S   I  =  1,  NX 
QV1(I,  J)  =  QVT1'(I,  J) 
DO  S   J  =  1,  NY1 
QV(NX,  J)  =  QV(1,  J) 
DO  S   J  -  1,  NY1 
QV(NXP,  J)  -  QV(2,  J) 
DO  S1Q  J  -  1,  NY1 
QV(NX+2,  J)  =  QV(3,  J) 
(c)   The  Distributed  NP 
Figure  16.   A  Name  Partition  and  Its  Horizontally  Distributed  Version 
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DO  S   I  =  1,  N 

51  C(I)  =  C(I)  -  A(I) 

52  A(I)  =  4*A(I)*B(I)  -  2 

53  B(I)  =  B(I)*A(I)  +  B(I) 


Figure  17-a.   A  Loop  Referencing  Multi-page  Arrays 


DO  S   I  =  1,  N 

51  C(I)  =  C(I)  -  A(I) 
DO   S   I  =  1,  N 

52  A(I)  =  4*A(I)*B*(I)  -  2 
DO   S    I  -  1,  H 

53  B(I)  =  B(I)*A(I)  +  B(I) 


Figure  17 -b .   Horizontally  Distributing  the  Loop  of  Figure  17-a 
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Definition  3.4  A  basic  Tr-block  is  a  TT-block  in  which  all  the  statements 
are  at  the  same  nest  depth  level.   Some  of  the  statements  of  a  nonbasic 
TT-block  will  fall  at  different  nest  depth  levels. 

The  vertical  distribution  transformation  handles  NP's  which  have 
only  basic  tt -blocks.   Such  NP's  are  called  basic  NP's. 

3.5.2  Page  Indexing  and  Vertical  Distribution  of  Basic  Name  Partitions 

In  the  following  subsections  we  will  often  need  to  refer  to  a  set 
of  consecutive  integers.   We  now  define  a  function,  INT,  which  will  denote 
such  a  set.   We  also  give  a  formal  definition  of  a  basic  NP. 

Definition  3.5  Let  w  and  k  be  two  integers   w  >  0.   The  function  INT(w,k) 
will  denote  the  set  of  consecutive  integers  {(k-l)*w+l,  (k-l)*w+2,  ..., 
(k-l)*w+w-l,  k*w). 

Definition  3.6  A  basic  NP,  BNP,  is  denoted  by 

BNP  =  (I1  «■  ov    I2  «-  a2,  ...,  Id  «-  ad)  (B^  B2,  ...,  B  )  where 

I.  is  an  NP  index,  a.  is  an  ordered  index  set,  and  B.  is  a  basic  TT-block 
J  J  J 

or  another  BNP. 

In  some  cases  index  variables  of  an  NP  are  never  used  in  subscript 
expressions  of  arrays.   They  are  used  as  some  kind  of  counters.   We  wish 
to  differentiate  between  the  DO  statements  associated  with  such  index 
variables  and  those  associated  with  index  variables  used  in  subscript 
expressions. 

Definition  3.7  A  type-A  DO  statement  has  an  index  variable  which  is 
used  in  some  subscript  expressions  in  an  NP.  If  the  index  variable  of  a 
DO  statement  is  never  used  in  a  subscript  expression,  then  such  a  DO 


95 
statement  is  of  type  B.   In  Figure  18,  DJ  and  DI  are  type-A  DO  statements. 

DIJ  is  of  type-B. 

3.5.2.1  Vertical  Distribution  of  Elementary  NP's 

By  definition,  an  elementary  NP  has  one  DO  statement  and  no  multi- 
dimensional arrays.   The  NP  in  Figure  17-a  is  an  example  of  elementary 

NP's.   Let   a   be  the  ordered  index  set.   Let  I  .   be  the  smallest 

mm 

integer  in  O    and  I    be  its  maximum  integer.   If  I  .  >eINT(Z,k.  .  )  and 
max  °        mm        mxn 

I    ElNT(Z,k   )   then  vertical  distribution  of  the  elementary  NP  means 
max        max  J 

executing  its  first  TT-block,  it.  ,  for  the  ordered  index  set  crAlNT(Z,k  .  ), 

1  mm 

then  executing  tt~  for  the  same  set  and  so  on  until  the  last  TT-block  is 
executed  for  this  set  of  values  of  the  index  variable.   The  same  process 

is  repeated  for  the  ordered  index  set  aAlNT(Z,k  .   +1).   We  keep  intera- 

mm 

ting  until  we  execute   all  the   Tr-blocks   for   the   last   subset   of   the   index 

variable   set,    namely  afllNT(Z ,k        ). 

max 

Figure  19  shows  the  vertically  distributed  version  of  the  NP  in 
Figure  17-a.   Vertical  distribution  is  achieved  by  adding  a  set  of  state- 
ments called  the  page  indexing  statement  set.   In  Figure  19  these  are  the 
ADD1I,  ADD2I,  and  ADD3I  statements.   ADD1I  is  the  paging  DO  statement. 
Its  scope  includes  all  the  Tr-blocks  in  the  NP.   Statement  ADD2I  defines 
the  lower  bound  of  a(\lNT(Z ,  IP)  for  all  the  values  of  IP:   1,  2,  ...,  [N/Zl. 
Similarly  ADD3I  defines  the  upper  bound  of  a(\lNT(Z , IP) .   We  will  refer  to 
statements  ADD2I  and  ADD3I  as  the  lower  bound  and  upper  bound  definition 
statements  of  the  index  variable  I. 

Note  that  the  vertically  distributed  program  in  Figure  19  does  not 
have  the  increased  I/O  problem  of  the  horizontally  distributed  program  in 
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DIJ 


DJ 


DI 


10 


DO   10   I J  =  1,  3 

PK(1)  =  1.  -  G*DZ/(2.*PT(1)*QV0(1)) 

DO   10   J  =  1,  NY1 

PK(J)  =  PK(J-l)*CP*QVO(J) 

DO  10   1=1,  NX1 

QV(I,  J)  =  HUM(J)*QVS 

QV1(I,  J)  =  QV(I,  J) 

CONTINUE 


Figure  18.   A  Loop  with  Type-A  and  Type-B  DO  statements 


ADD1I      DO   10   IP  =  1,  fN/Zl 
ADD2I      ILB  =  1  +  (IP-1)*Z 
ADD3I      IUB  =  MIN(IP*Z,  N) 

DO   S    I  =  ILB,  IUB 
Sx      C(I)  =  C(I)  -  A(I) 

DO  S    I  =  ILB,  IUB 
S2      A(I)  =  4*A(I)*B(I)  -  2 
DO   S    I  =  ILB,  IUB 
B(I)  =  B(I)*A(I)  +  B(I) 
CONTINUE 
Figure  19.   The  Vertically  Distributed  Version  of  the  NP  in  Figure  17-a 


3 
10 
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Figure  17-b.   With  two  page  frames,  two  page  faults  will  occur  when  exe- 
cution is  started  , to  allocate  two  pages  of  the  arrays  A  and  C.   This  is 
followed  by  a  burst  of  CPU  activity  during  which  the  S   loop  will  be  exe- 
cuted for  Z  iterations.   A  page  fault  will  occur  when  a  B  page  replaces 
the  C  page  as  the  execution  of  the  S  loop  is  started.   The  execution 
of  this  loop  will  last  for  3*Z  memory  references.   The  same  A  and  B  pages 
will  be  used  when  the  S   loop  is  executed  for  4*Z  references.   Next  the 
value  of  IP  is  incremented  and  a  new  execution  cycle  is  started.   Thus 
the  number  of  page  faults  per  cycle  is  3  and  the  total  number  of  page 
faults  in  3MN/Z].   This  is  equal  to  the  number  of  faults  for  the  undis- 
tributed program.   Since  m  was  decreased  from  3  to  2,  the  space-time 
cost  was  also  decreased  by  the  same  factor,  namely,  3/2.   Table  5  compares 
the  space,  I/O  time,  and  the  space-time  cost  of  the  program  in  Figure  17-a, 
its  horizontally  distributed  version,  and  its  vertically  distributed  ver- 
sion. 

3.5.2.2   Vertical  Distribution  of  Multi-nested  Basic  Name  Partitions  with 
Multi-dimensional  Arrays 

As  was  mentioned  in  Chapter  2  we  will  adopt  the  submatrix  storage 
scheme  to  store  multi-dimensional  arrays.   We  start  this  section  by  illus- 
trating the  usefulness  of  the  page  indexing  transformation  in  matching  the 
pattern  of  reference  of  multi-dimensional  arrays  to  their  storage  scheme. 
We  then  describe  using  the  page  indexing  transformation  to  vertically  dis- 
tribute multi-nested  basic  NP's  which  reference  multi-dimensional  arrays. 

Consider  the  following  matrix  addition  program 
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Table  5.   Resource  Requirement  of  the  Program  in  Figure  17-a,  Its 

Horizontally  Distributed  Version,  and  Vertically  Distributed 
Version 


Critical 

Total 

Space- 

Memory 

Number  of 

Time 

Allotment 

Page  Faults 

Cost 

Original  Program  3  3*Tn/z1       9*Tn/z1 

After  Horizontal 

Distribution  2  6*Fn/z1       12*Tn/z1 

After  Vertical 

Distribution  2  3*Fn/z1       6*Tn/z1 


Program  10-a 

DO  10  I  =  1,  N 
DO  10  J  =  1,  N 
10     A(I,J)   =  B(J,I)  +  C(I,J) 

Although  the  behavior  of  this  program  is  improved  by  storing  the  arrays 
using  the  submatrix  scheme  rather  than  row -wise  or  column-wise,  the  MTBPF 
is  still  lower  than  predicted  by  the  ELM.   According  to  the  ELM  the  MTBPF 
is  3*Z.    For  Program  10-a  the  MTBPF  is  3*RZ,  where  RZ  = /Z .   The  page  index- 
ing transformation  will  make  the  MTBPF  equal  to  3*Z.   The  transformed  pro- 
gram is  shown  below. 
Program  10-b . 

ADD1I  •     DO   10   IP  =  1,  [N/RZ1 

ADD2I       ILB  =  1  +  (IP-1)*RZ 

ADD3I       IUB  =  MIN(IP*RZ,N) 

ADD1J       DO   10  JP  =  1,  [N/RZ1 
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ADD2J       JLB  =  1  +  (JP-1)*RZ 
ADD3J       JUB  =  MIN(JP*RZ,N) 

DO  10   I  =  ILB,  IUB 

DO   10  J  =  JLB,  JUB 
10  A(I,J)  =  B(J,I)  +  C(J,I) 

Again  the  idea  here  is  to  change  the  indexing  pattern  such  that 
the  maximum  possible  number  of  references  are  made  to  a  page  while  the 
page  is  in  primary  memory.   For  this  program  we  have  two  index  variables 
I  and  J.   The  index  variable  set  of  I ,  a   =  (1,  2,  ...,  N) ,  is  divided 
into  the  subsets  (a  f\  INT(RZ,  1) )  ,  (of)  INT(RZ,2)),  ...,  (p    (\  INT(RZ  J  N/RZl  ))  . 
Similary  G   is  divided  into  similar  subsets.   The  assignment  statement 
in  program  10-b  will  be  first  executed  for  the  subsets  (I  =  a  \\  INT(RZ,1))  x 
(J  =  a  OlNTCRZ.D).  Next  the  subsets  (I  =  a  fi  INT(RZ,1))  x  (J  =  a j(\  INT(RZ,2)) 
will  be  used.   The  rest  of  the  sequence  of  the  index  subsets  will  be: 
{(I  =  Oj.0  INT(RZ,l))x  (J  =  a  HINT (RZ, 3));  ...;  (I  -  aIrtlNT(RZ,l))x 
(J  =  OjO  INT(RZ,  fN/RZ  D);  (I  =  o  0  INT(RZ,2))x  (J  =  O  (1  INT(RZ,1));  •••; 
(I  =  a].fiINT(RZ,  fN/RZ  D)x(J  =  0  /UNT(RZ,  fN/RZ  1)  )  } .   When  IP  =  ip   and  JP  = 
jp  the  index  variables  subsets  will  be  (I  =  a    (\  INT(RZ, ip,))  and  (J  = 
a  (\  INT(RZ, jp  )) .   With  this  pattern  of  indexing,  page  indexing,  all  the 
elements  of  an  A,  B,  or  C  page  will  be  referenced  before  any  elements  of 
any  other  pages.   Thus  using  page  indexing  and  with  3  page  f rames  ,the  pro- 
gram will  have  the  minimum  number  of  page  fault,  3*rN/RZl*pN/RZl .   The 
MTBPF  will  be  3*Z  and  MTBR  will  be  3  references. 

As  is  shown  in  Program  10-b  page  indexing  was  achieved  by  adding  a 
paging  statement  set  for  every  type-A  DO  statement  in  the  program.   For 
the  DO  statement  of  the  I  index  variable,  the  statements  ADD1I,  ADD2I, 
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and  ADD3I  were  added.   The  scope  of  the  ADD1I  paging  DO  statement  is  identi- 
cal to  the  scope  of  the  I  DO  statement  in  the  original  program.   The  ADD2I 
defines  the  lower  bound  of  the  subset  (a   f\  INT(RZ,IP) )  and  the  ADD3I  state- 
ment define  the  upper  bound  of  the  same  subset.   Similar  remarks  apply  to 
the  ADD1J,  ADD2J,  and  ADD3J  statements. 

Vertical  distribution  of  multi-nested  basic  NP's  can  be  achieved 
by  adding  a  page  indexing  step  to  the  horizontal  distribution  algorithm. 
After  the  Tr-blocks  of  the  NP  are  identified,  each  type-A  DO  statement 
will  be  replaced  by  an  appropriate  paging  statement  set.   Each  iT-block 
which  was  in  the  scope  of  a  replaced  DO  statement  will  be  enclosed  by  a 
new  DO  statement  using  the  old  index  variable  and  the  bounds  of  the  index 
set  as  defined  in  the  added  paging  statement  set.   The  control  of  all 
type-B  DO  statements  will  be  distributed  on  the  relevant  TT-blocks  without 
any  modification.   We  illustrate  the  vertical  distribution  procedure  by 
considering  the  following  example: 

Program  11-a. 


DI 

DO  10   I  =  1,  N 

DJ 

DO   10   J  =  1,  N 

si 

C(I,J)  =  0 

DK 

DO  10  K  =  1,  N 

S2 

C(I,J)  =  C(I,J)  +  A(I,K)*B(K,J) 

10 

CONTINUE 

There  are  two  ir-blocks  in  this  program  tt  =  {S  },  and  tt  =  {S  }. 
There  are  three  type-A  DO  statements  DI,  DJ,  and  DK.   The  scope  of  DI  and 
DJ  includes  both  tt  and  tt  .   The  scope  of  DK  includes  only  tt2<   Thus  the 
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scope  of  the  paging  DO  loops  in  the  vertically  distributed  version  of  the 
program  (ADD1I  and  ADD1J  in  the  program  below)  will  include  tt  and  tt  .   The 
scope  of  the  paging  loop  in  the  statement  set  replacing  DK  will  include 
only  S  .   The  vertically  distributed  version  of  Program  11-a  is  as  follows: 


Program  11-b . 
ADD1I 
ADD2I 
ADD3I 
ADD1J 
ADD2J 
ADD3J 


DO   10   IP  =  1,  [N/RZl 
ILB  =  1  +  (IP-1)*RZ 
IUB  =  MIN(IP*RZ,N) 
DO   10   JP  =  1,  [N/RZ] 
JLB  =  1  +  (JP-1)*RZ 
JUB  =  MIN(JP*RZ,N) 


DO   S    I  =  ILB,  IUB 
DO   S   J  =  JLB,  JUB 


sl 

C(I,J)    =   0 

ADD1K 

DO      10     KP=l,[N/RZl 

ADD2K 

KLB  =    1  +   (KP-1)*RZ 

ADD3K 

KUB   =  MIN(KP*RZ,N) 

2 
10 


DO   S    I  =  ILB,  IUB 

DO   S    J  =  JLB,  JUB 

DO   S    K  =  KLB,  KUB 

C(I,J)  =  C(I,J)  +  A(I,K)*B(K,J) 

CONTINUE 


Note  that  in  this  program  a  page  of  the  C  array  will  be  initial- 
ized in  tt   then  the  same  page  will  be  referenced  in  tt  .   Hence  with  verti- 
cal distribution , a  page  which  is  referenced  in  several  TT-blocks  will  not 
leave  memory  until  it  has  been  used  in  all  these  TT-blocks. 


102 
3.5.2.3  Vertical  Distribution  of  Basic  NP's  -  the  General  Algorithm 
and  Some  Implementation  Considerations 
After  introducing  the  concept  of  vertical  distribution  by  examples 
in  the  previous  two  sections  we  now  present  the  general  algorithm. 

(i)    Construct  the  data  dependence  graph  and  identify  the  TT-blocks 

of  the  NP  as  described  in  Section  3.5.1.1  • 
(ii)   Start  with  the  outmost  type-A  DO  statement.   Replace  it  by  an 
appropriate  page  indexing  statement  set.   The  scope  of  the 
paging  loop  is  the  same  as  the  scope  of  the  replaced  DO  state- 
ment, 
(iii)   Enclose  each  Tr-block  which  was  within  the  scope  of  the  replaced 
DO  statement  by  a  DO  statement  using  the  same  index  variable. 
The  upper  and  lower  bounds  of  the  index  set  are  as  defined  in 
the  added  page  indexing  statement  set.   The  increment  is  the 
same  as  in  the  replaced  DO  statement, 
(iv)   Repeat  (ii)  and  (iii)  for  the  next  outermost  type-A  DO  state- 
ment.  This  process  continues  until  all  type-A  DO  statements 
have  been  replaced.   The  control  of  all  type-B  DO  statements 
will  be  distributed  on  the  relevant  TT-blocks  as  done  in  the 
horizontal  distribution  algorithm. 
We  note  that  the  added  complexity  of  the  distribution  algorithm 
due  to  page  indexing  is  0  (//  of  DO  statements  in  the  NP) . 

In  all  the  examples  discussed  in  the  previous  sections  all  the 
subscript  expressions  were  linear  functions  of  one  index  variable,  i.e., 
of  the  form  a*index  variable  +  3.   Moreover,  for  these  examples  the  coef- 
ficient of  the  index  variable,  a,  was  the  same  for  all  the  subscript 
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expressions  and  it  was  equal  to  1.   3  was  equal  to  zero  in  all  expressions. 
If  3  ^  0  for  some  expressions,  we  will  still  use  the  same  implementation 
techniques  as  illustrated  in  the  examples.   If  a  ^  1  but  it  was  the  same 
number,  c,  for  all  subscript  expressions,  our  implementation  method  can 
be  modified  slightly  to  accomodate  such  cases.   This  is  illustrated  in 
the  following  example. 
Program  12-a  . 

DO   1   I  =  1,  N 
S         A(3I)  =  B(3I)*3 
S2         D(3I)  =  B(3I-l)/3 
1  CONTINUE 

The  vertically  distributed  version  of  this  program  is  shown  below 
Program  12-b  . 

ADD1I      DO   1   IP  =  1, Tn/  lZ/3  J  1 
ADD2I      ILB  =  1  +  (IP-1)*IZ/3J 
ADD3I      IUB  =  MIN(IP*lZ/3J,N) 

DO   S    I  =  ILB,  IUB 
S  A(3I)  =  B(3I)*3 

DO   S    I  =  ILB,  IUB 
S2         D(3I)  =  B(3I-l)/3 
1  CONTINUE 

We  note  that  IZ/3J  is  the  number  iterations  which  is  spent  by 

Program  12-a  referencing  one  page  of  A,  one  page  of  B,  and  one  page  of  D. 

Thus  [N/IZ/3J1  is  total  number  of  pages  of  A  referenced.   Similarly  the 

same  number  of  pages  of  the  B  and  C  arrays  are  referenced.   Program  12-b 
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will  have  [N/[Z/3J1   cycles.   In  each  cycle  2*LZ/3J  references  will  be  made 
to  two  pages  of  A  and  B  in  the  S..  loop.   This  is  followed  by  2*  LZ/3  J 
references  made  to  the  same  B  page  and  a  D  page  in  the  S_  loop. 

In  general,  if  the  coefficient  of  all  the  index  variables  in  all 
the  subscript  expressions  was  the  number  c,  Z  should  be  replaced  by  LZ/cJ 
in  the  added  statement  set  (or  RZ  should  be  replaced  by  [RZ/cJ  when  multi- 
dimensional arrays  are  involved) .   If  the  coefficient  of  the  index  vari- 
ables were  not  the  same  for  all  subscript  expressions,  we  use  their  mini- 
mum, c  .  .   Thus  Z  will  be  replaced  by  IZ/c  .  I  in  the  added  statement 
mxn  r       '  l   mxnJ 

set.   Such  cases , where  the  subscript  expressions  are  more  complex  functions 
of  one  or  more  index  variables  ,are  of  little  practical  interest  and  hence 
we  will  not  discuss  such  cases  any  further. 

Before  leaving  this  section  we  remark  that  the  lower  bound  of  the 
added  paging  DO  statement  was  equal  to  1  in  all  our  examples.   This  is  not 
true  in  the  general  case.   If  the  lower  bound  of  the  index  set  of  the  re- 
placed DO  statement  was  I  .   and  I  .  eINT(Z,k  .  )  then  the  lower  bound  of  the 

min      min        mm 

paging  index  set  will  be  k  .   (in  the  case  where  multi-dimensional  arrays 

mm 

are  involved  I  .  cINT(RZ,k  .  )).   Note  that  the  added  lower  bound  definition 
mm         mm 

statement  should  be  adjusted  to  make  ILB  =  I  .   when  IP  =  k  .  . 

mm  mm 

3.5.2.4   The  Correctness  of  the  Page  Indexing  Transformation 

The  correctness  of  the  horizontal  distribution  algorithm  is  obvious 
from  the  definition  of  data  dependences  and  TT-blocks.   When  page  indexing 
is  used  to  achieve  vertical  distribution,  the  order  of  referencing  elements 
of  multi-dimensional  arrays  in  7T-blocks  is  different  from  the  order  of 
their  reference  as  specified  in  the  undistributed  program.   Thus  we  need 
to  establish  some  necessary  and  sufficient  conditions  which  can  be  used 
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to  test  whether  the  page  indexing  transformation  is  valid.   We  will  illus- 
trate the  problem  by  considering  the  following  example. 
Program  13-a. 

DO   1   I  =  1,  48 

DO   1   J  =  1,  48 

51  A(I,J)  =  B(I,J)*2 

52  C(I,J)  =  A(I-1,  J+l)/2 
1  CONTINUE 

In  this  program  there  is  one  dependence  relation,  namely  S   is 
data  dependent  on  S  .   Thus  there  will  be  no  cycles  in  the  data  depend- 
ence graph  and  the  program  can  be  horizontally  distributed  as  shown  below. 
Program  13-b  . 

DO  S  I  =  1,  48 
DO  S  J  =  1,  48 
Sj  A(I,J)  =  B(I,J)*2 
DO  S2  I  =  1,  48 
DO  S  J  =  1,  48 
s2        C(I,J)  =  Att-1,  J+D/2 

For  a  page  size  of  64  words  we  get  the  following  program  if  we 
apply  page  indexing  to  Program  13-b. 
Program  13-c 

ADD1I      DO   10   IP  =  1,  6 

ADD2I      ILB  =  1  +  (IP-1)*8 

ADD3I      IUB  =  IP*8 
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ADD1J      DO   10  JP  =  1,  6 
ADD2J      JLB  =  1  +  (JP-1)*8 
ADD3J      JUB  =  JP*8 

DO   S   I  -  ILB,  IUB 

DO  S   J  -  JLB,  JUB 

51  A(I,J)  =  B(I,J)*2 
DO  S2   I  =  ILB,  IUB 
DO   S   J  =  JLB,  JUB 

52  C(I,J)  =  A(I-1,  J+l)/2 

Program  13-c  will  produce  erroneous  results.   To  see  this  consider 
for  example  the  value  assigned  to  C(2,  8)  in  S  .   On  the  right-hand  side 
of  S   the  value  of  A(l,  9)  is  used  in  computing  C(2,  8).   In  Programs  13-a 
and  13-b  this  value  of  A(l,  9)  will  be  computed  in  S  .   In  Program  13-c 
the  value  of  A(l,  9)  used  to  compute  C(2,  8)  is  an  old  value,  i.e.,  when 
the  assignment  to  C(2,  8)  is  made  the  new  value  computed  in  S   for  A(l,  9) 
has  not  been  stored  in  A(l,  9)  yet.   Hence  Program  13-a  cannot  be  vertically 
distributed. 

To  simplify  our  discussion  of  this  subject  we  will  consider  only 
a  basic  Tr-block  with  only  one  assignment  statement.  This  will  be  of  the 
form: 

Program  14-a  . 

DO   S   I  =  1,  N 
DO   S   I  -  1,  H 
S         A(F1(I1),  F2(I2))  =  ACf^I^,  f2(I2))  +  <  an  expres- 
sion not  containing  references 
to  A> 
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where  F  (I  )  and  f-.(I-,)  are  linear  functions  of  I..   Similarly  F-CI-) 
and  f_(I~)  are  linear  functions  of  I„.   At  the  end  of  this  section  we 
will  discuss  extending  our  analysis  and  theorems  to  cover  more  general 
cases. 

The  problem  here  is  to  find  sufficient  and  necessary  conditions 
for  the  correctness  of  page  indexing  Program  14-a,  i.e.,  we  want  to  test 
whether  the  following  program  will  produce  identical  results  to  those 
produced  by  Program  14-a: 
Porgram  14-b. 

DO   S   IP  =  1,  [N/RZl 

ILB  =  1  +  (IP  -1)*RZ 

IUB  =  MIN(IP*RZ,N) 

DO   S   IP.  -  1,  [N/RZl 

ILB   =  1  +  (IP2-1)*RZ 

IUB2  -  MIN(IP2*RZ,  N) 

DO   S    I  =  ILB  ,  IUB 

DO   S    I  =  ILB2,  IUB2 
S  A(F1(I1),  F2(I2))  =  A(f1(I1),  f2(I2))  +  ... 

Figure  20-a  shows  the  ItxIt  plane.   Each  point  (i- ,  i„)  in  this 
plane  can  be  associated  with  the  execution  of  the  statement  S  when  I  = 
i1  and  I„  =  i„.   One  can  imagine  a  cursor  that  moves  from  one  point  to 
another  in  the  l-|Xl„   plane  as  S  is  executed  with  the  index  variables 
taking  the  values  of  the  coordinates  of  the  first  point, then  executed 
with  the  index  variables  taking  the  values  of  the  coordinates  of  the 
second  point,  etc.   Thus  the  cursor  will  trace  a  particular  curve  in  the 
I..xl    plane  during  the  execution  of  S(actually  it  will  visit  discrete 
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I-  initial  value 


I   final  value 


I   initial 
value 


Page 
Boundary 


Figure  20-a.   The  Curve  Traced  by  the  Cursor  in  the  I-,xI9  Plane  when 
Program  14-a  Is  Executed. 
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Figure  20-b.  The  Curve  Traced  In  the  I-|Xl9  Plane  when  Program  14-b  Is 
Executed. 
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points  on  the  curve) .   Figure  20-a  shows  the  curve  traced  by  the  cursor 
when  Program  14-a  is  executed.   In  the  figure  N  =  8. 

If  the  cursor  passes  through  the  point  P  with  the  coordinates 
(i  ,  i„)  before  the  point  P'  with  the  coordinates  (i' ,  i')  we  will  say 
that  P  precedes  P'  and  denote  this  by  P  <  P'  or  (i  ,  i  )  <  (i'   i' )  . 

Figure  20-b  shows  the  curve  traced  by  the  cursor  when  Program 
14-b  is  executed.   In  the  figure  RZ  =  2. 

According  to  the  execution  sequencing  of  S  in  Program  14-a,  if 
0UT(S(i^,  ip)  eIN(S(i  ,  i2))  and  (i*   ij)  <  (1  ,  1J    then  there  is  a 
dependence  vector,  V  =  (i  ,  i?)(i'   i')  from  point  P'  to  point  P  in  the 
I-.XI-  plane.   In  general,  if  there  are  references  to  several  different 
elements  of  A  on  the  right-hand  side  of  S,  there  might  be  several  depend- 
ence vectors  from  several  points,  P1,  P",  P,M,  ...  to  the  point  P.   The 
points,  P',  P" ,  ...  are  called  the  source  points  of  these  dependence 
vectors  and  the  point  P  is  the  destination  point.   The  page  indexing 
transformation  will  be  correct  if  and  only  if,  for  all  computed  points, 
the  cursor  will  pass  through  all  the  dependence  source  points  of  each 
given  point  before  it  passes  through  the  point  itself. 

As  an  example  consider  the  program: 

Program  15  . 

DO   10   I  =  1,  4 

DO   10   I2  =  1,  4 

10         A(IX  +  1,  I2)  =  AO-I^  5-I2) 

Table  6  lists  the  points  visited  by  the  cursor,  their  coordinates  in  the 
I^xl   plane,  0UT(S(I  ,1  )),  and  IN(S(I   I„)).   Examining  the  table  we 
find  the  following  dependence  vectors: 
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Table  6.   The  Points  on  the  Execution  Trace  of  Program  15. 
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Figure  21    shows   these  dependence  vectors   in  the     I  xL   plane.      If   this 

program  is  page   indexed   for   a  page   size  of   4   the  cursor  will  visit   the 

points  of   the   I  xl?  plane   in  the   following   order: 

PPPPPPPPPP  P  P  P  P  P 

1»      V      5'      6'      3'      4'      V      8'      9'      10'      13'      14'      11'      12'      15, 

P16 

We  note  that  the  source  point  of  every  dependence  vector  is  visited  before 
its  destination  point.   Thus  the  page  indexing  transformation  is  valid  for 
a  page  size  of  4.   The  transformation  will  not  be  valid,  however,  for  a 
page  size  of  9 .   In  this  case  P_  will  be  visited  before  P.. 

We  present  next  a  theorem  to  be  used  in  testing  the  validity  of  the 
page  indexing  transformation  for  all  page  sizes. 

Theorem  3.2  For  the  program: 


DO   S   I   =  1,  N 
DO   S   I   =  1,  N 
S  A(F1(I1),  F2(I2))  =  A(f1(I1),  f2(I2))+<  an 

expression  not  containing 
references  to  A  > 
Let  F  ,  f  be  linear  functions  of  I,  and  F„ ,  f  be  linear  functions  of  I~. 
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Figure  21.   Dependence  Vectors  for  Program  15 
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Moreover,  Let 

Yx  =  {F1(l),  F1(2),  ...,  F1(N)} 

Yl  =  {f1(l),  f1(2),  ...,  f1(N)} 

Y2  =  (F2(l),  F2(2),  ...,  F2(N)} 

y2  =  {f2(l),  f2(2),  ...,  f2(N)} 

Then  the  page  indexing  transformation  cannot  be  applied  to  this  program 
if  and  only  if  both  of  the  following  conditions  are  true: 

CI: 

Tl  =  Yl^yl  *   *  =  {Fl(kH>»  F1^12)i    ••"  Fi(kim)}  = 

{f1(k21),  f1(k22),  ...,  ^(k^)} 

and  there  exists  k.,      and  k0    ,    1  <    p  <  m  such   that   k,      <  k~    . 
lp  2p'        -  v  -  lp  2p 

Note   that   F.(k,    )    =   f . (k0    )  e T. . 
1     lp  1      2p  1 

C2: 

t2  -i2ny2M-  <*2(Jii>'  W-  ••••  F2(JU)}  = 

Cf2(J21),    f2(J22).    •.-,    f2(%)} 

and   there   exists   j.      and   j„    ,    1   <   q    <   I    such   that   j        >    i „    . 
iq  2q  —       -  lq  2q 

Note   that   F2(jlq)    -   f2(J2q)cTr 

Proof: 

The  theorem  states  that  the  combined  condition  C  =  C1*C2  is  a 
necessary  and  sufficient  condition  for  the  page  indexing  transformation 
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not  to  be  valid.   This  is  equivalent  to  saying  that  the  page  indexing 
transformation  is  valid  if  and  only  if  CI  is  not  true  or  C2  is  not  true. 

If  T   is  an  empty  set  then  CI  will  not  be  true.   Note  that  since 
F  and  f,  are  functions  only  of  I.  then  the  program  will  write  in  a  parti- 
cular row  of  A  using  only  points  from  a  single  row.   Thus  if  T  =  0,  then 
when  the  program  is  writing  in  a  row  of  A  it  will  read  values  from  points 
on  a  row  which  was  never  (  and  will  never  be)  written  into.   Thus  there 
will  be  no  data  dependence  vectors  between  any  two  points  of  the  I-|Xl9 
plane.   This  means  that  the  cursor  can  visit  the  points  in  the  I-xI ~ 
plane  in  any  order,  and  hence  the  page  indexing  transformation  will  be 
valid. 

C2  will  not  be  satisfied  if  T„  is  empty.   By  an  argument  similar 
to  the  one  presented  in  the  previous  paragraph,  if  T~  =  0  there  will  be 
no  data  dependence  vectors  between  any  two  points  of  the  I-ixIo  plane. 
Thus  the  transformation  will  be  valid. 

If  T1  ^  0  and  T„  ^  0  then  dependence  vectors  may  exist.   Consider 
Figure  22.   When  the  cursor  is  at  point  P  (k?  ,  j„  )  (i.e.,  the  program  is 

assigning  a  value  to  A(F  (k  ) ,  F2^2  ^'  the  value  of  A^fl^k2p^'  f2^2q^ 
will  be  used  on  the  right-hand  side  of  S.   If  f  (k~  )  eT  andf„(j„  )eT„,  then 

there  must  exist  kn   such  that  Fn  (kn  )  =fn(k0  )  and  j ,  such  that  F„(j.  )  = 
lp  1   lp    1   2p      Jlq  2  Jlq 

f_(j_  ).  Thus  there  will  exist  a  vector  from  the  point  P  =  (kn  ,jn  )  to  the 
2  J2q  r      x    lp  Jlq 

point  P  =  (k   ,  j_  ).   Let  6  be  the  angle  between  the  vector  drawn  from 
zp    zq 

P  to  P  and  the  I„  direction.   As  shown  in  Figure  22,  9  can  take  any  value 
between  0°  and  360°.   From  our  previous  description  of  the  manner  in  which 
the  cursor  will  travel  in  I^Io  when  page  indexing  is  used  (see  Figure  20-b) 
we  conclude  the  transformation  will  be  valid  for  all  page  sizes  if  and  only  if 
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Figure  22.   Dependence  Vectors  in  the  IiXl9  Plane 
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0°  £  6  <_  90°  or  180° £  9  <_   360°.   In  other  words,  the  transformation  will  not 

be  valid  if  and  only  if  90°  <  6  <  180°. 

For  90°  <  9  <  180°,  sin6  >  0  and  hence  k„  -  kn   >  0.   Moreover 

2p    lp 


cos6  <  0  and  hence  i_   -  1,   <  0. 

2q   Jlq 


Q.E.D. 


Fince  F  and  f  are  linear  functions  of  I  and  similarly  F  and 
f_  are  linear  functions  of  I_  the  following  theorem  can  be  used  to  test 
whether  condition  CI  or  C2  of  Theorem  3  is  satisfied. 


Theorm  3.3  [BANE781 : 

Given  the  two  functions 

f(I)  =  a  +  al    and 

g(I)  =  3  +  bJ 
where  a,  3,  a,  b  are  integer  constants  (^0),  and  I  is  an  integer  variable 
such  that  1  1  I  1  N,  then  the  two  sets  {f(l),  ...,  f (N) }  and  (g(l),  ..., 
g(N) }  intersect  and  there  will  be  at  least  two  integers  i,  j  such  that 
f(i)  =  g(j)  with  i  <  j,  if  and  only  if  the  following  conditions  are 
satisfied. 

(A)  gcd(a,  b)  =  d  divides  $-a;  and 

(B)  [max  U(io>Jo)l  <  Lmin  V(iQ,Jo)J 


where 


(i)   gcd(a,  b)  is  the  greatest  common  divisor  of  a  and  b. 
(ii)   (i=i,j=j)is  any  solution  to  the  equation 

ai  -  bj  =  3  -  a 

(iii)   the  two  sets  U  =  U(i  ,  j  )  and  V  =  v(i  ,  j  )  are  defined 

o   o       ~     o   o 

as  follows  : 
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(1-1  )*d/b   is  in  U  if  b  >  0,  in  V  if  b  <  0; 

(N  -  i  )*d/b  is  in  U  if  b  <  0,  in  V  if  b  >  0; 
o 

(1  -  j  )*d/a  is  in  U  if  a  >  0,  in  V  if  a  <  0; 

(N  -  j  )*d/a  is  in  U  if  a  <  0,  in  V  if  a  >  0. 

(i  -  j  +l)*d/(a-b)  is  in  U  if  a  >  b,  in  V  if  a  <  b . 

Proof:   see  [BANE78] 

We  now  illustrate  the  use  of  Theorem  3.3  in  testing  CI  and  C2  of 
Theorem  3.2.   Consider  the  following  program: 

DO   S   I  =  1,  9 
DO   S   I   =  1,  9 
S  A(2I  -1,  I2+2)  =  A(I  +1,  10-I2) 

We  first  check  if  CI  is  true.   Thus  we  test  whether  the  two  functions 

f(I)  =  21  -  1  =  al  +  a    and 

g(J)  =J+l=bJ+3 

will  intersect  and  whether  there  is  some  i  and  j  such  that  f(i)  =  g(j) 

and  i  <  j.   The  gcd(a,  b)  =  1  and  3  -  a   =   2.      Thus  gcd(a,b)  divides  3  —  a. 

A  particular  solution  to  the  equation  21  -  J  =  2  is  i  =10  and  i   =  18. 

o  Jo 

U  is  the  set  (-9,  -8.5,  -7)  and  the  set  V  is  (-1,  -4.5).  [maxU(i  ,  j  )]    = 

-7  and  [ min  V(i  ,  j  )i   =  -5.   Hence  condition  (B)  is  satisfied, 
o   o 

Thus  CI  is  true  of  this  program.   Now  we  test  whether  C2  also 
holds.   Thus  we  test  whether  the  two  functions 
f(I)  =  -I  +  10  =  al  +  a   and 
g(j)  =  J  +  2  =  bJ  +  3 
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intersect  at  some  I  =  i  and  J  =  j,  such  that  i,  <  j  - . 
Here  we  have 

a  -  3  =  8 

gcd(a,  b)  =  1 

Hence  condition  (A)  is  satisfied.   A  particular  solution  to  the  equation 

-I  -  J  =  -8  is  i  =2  and  i   =6.   Thus  we  have: 
o         Jo 

U  =  (-1,  -3),  Tmax  U(iQ,Jo)l  =  -1 
V  =  (7,  5,  |),Lmin  V(iQ,Jo)J  =  1 

Hence  condition  (B)  is  also  satisfied  and  C2  holds  for  this  program. 
Since  both  CI  and  C2  are  true, page  indexing  cannot  be  applied  to  this 
program. 


In  Theorem  3.3  we  assumed  that  a  4   0,  b  t   0,  and  a-b  4   0.   The 
conditions  to  be  tested  are  simple  if  these  assumptions  did  not  hold. 
For  example,  if  a  ^  0  and  b  =  0  then  the  two  functions  f (I)  =  al  +  a 

R  —  Ci 

and  g(J)  =  3  will  intersect  if  and  only  if  is  an  integer  between 

a 

1  and  N.   In  the  case  where  a  =  b  ^  o  the  two  functions  will  intersect 


if 


a  - 


is  an  integer  between  1  and  N.   For  this  case  the  two  functions 


f(I)  =  bl  +  a  and  g(J)  =  bJ+  $  will  intersect  at  the  points  (I  =  i, 
j.i+Aii),  1.1,  2 N-^fA 

If  different  elements  of  the  array  A  are  referenced  in  the  right- 
hand  side  of  the  statement  S  in  the  NP  under  consideration,  then  we  use 
Theorems  3.2  and  3.3  to  determine  whether  CI  and  C2  hold  between  the  sub- 
script expression  of  the  output  variable  A(F  (I  ) ,  F„(I~))  and  the  sub- 
script expressions  of  every  reference  to  a  different  element  of  A  on  the 
right-hand  side  of  S.   If  the  7T-block  has  more  than  one  statement  then 
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we  do  the  testing  between  the  set  of  multl- dimensional  array  output 
variables  and  all  references  to  different  elements  of  these  arrays  in 
the  set  of  input  variables  of  the  ir-block.   Note  that  for  the  Tr-block: 


DO  S   I,  =  1,  Nn 
ml       1 

DO  Sm  I2  =  1,  N2 

Sl 

S2 


s 

m 


m 
the  set  of  output  variables  is  \j  OUT(S  (I  ,  I9))  and  the  set  of  input 

k=l     k  1    Z 

variables  is  given  by: 


m  k-1 

U  [IN(S,(T,  I  ))  -  U0UT(S.(i.,  ij)] 

k=l  l  1=1 


If  the  basic  NP  has  several  TT-blocks  we  must  do  the  testing  be- 
tween the  set  of  multi-dimensional  output  variables  of  the  NP  and  their 
occurrences  in  its  set  of  input  variables. 

3.5.3  Transforming  Nonbasic  TT-Blocks  into  Basic  TT-Blocks 

The  page  indexing  algorithm  does  not  achieve  its  goals  if  applied 
to  nonbasic  TT-blocks.   As  an  example  consider  the  following  program: 
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Program  16-a. 
DI 
S, 


DO  S    I  =  1,  N 
B(I,1)  =  A(I,1)**.5 


DJ     DO   S   J  =  1,  N 
S2     A(I+1,  J)  =  B(I,  J)  +  C(I,  J) 
If  we  apply  the  page  indexing  algorithm  as  described  in  the 
previous  section  to  Program  16-a,  we  get  the  following  program: 


Program  16-b. 

DO   10   IP  =  1,  fN/RZl 

ILB  =  1  +  (IP-1)*RZ 

IUB  =  MIN(IP*RZ,  N) 

DO   10   I  =  ILB,  IUB 
S      B(I,1)  =  A(I,1)**.5 

DO  10  JP  =  1,  TN/RZ1 

JLB  =  1  +  (JP-1)  *  RZ 

JUB  =  MIN  (JP  *  RZ,N) 

DO  10  J  =  JLB,  JUB 

A(I+1,J)  =  B(I,J)  +  C(I,J) 

CONTINUE 
We  note  that  the  index  sequencing  of  Program  16-b  is  identical  to  that 
of  Program  16-a.   The  advantages  of  page  indexing,  i.e.,  making  the 
maximum  number  of  references  to  a  page  while  it  is  in  main  memory  are 
not  achieved. 

Any  nonbasic  it -block,  however,  can  be  changed  to  a  basic  one 
by  expanding  the  scope  of  some  of  its  DO  statements  to  make  all  assignment 


2 

10 
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statements  fall  at  the  same  nest  depth  level.   Of  course  some  of  these 
statements  must  now  be  executed  conditionally.   For  example  the  scope 
of  the  DJ  statement  in  Program  16-a  can  be  expanded  to  include  S  .   The 
resulting  basic  TT-block  is  shown  below. 
Program  16-c. 

DI   DO   S   1=1, N 
DJ   DO   S   J=1,N 

51  IF(J.EQ.l)   B(I,1)  =  A(I,1)**.5 

52  A(I+1,J)  =  B(I,J)  +  C(I,J) 

The  page  indexing  transformation  will  now  be  effective.   This  is  shown 
below. 

Program  16-d. 

DO    S    IP  =  1,  TN/RZl 
ILB  =  1  +  (IP-1)*8 
IUB  =  MIN(IP*8,N) 
DO    S   JP  =  1,  fN/RZl 
JLB  =  1  +  (JP-1)*8 
JUB  -  MIN(JP*8,N) 
DO    S    I  =  ILB,  IUB 
DO     S    J  =  JLB,  JUB 
S        IF(J.EQ.  JLB.AND.JP.EQ.l) 

B(I,1)  =  A(I,1)**.5 
S2      A(I+1,J)  =  B(I,J)  +  C(I,J) 
Note  the  modification  in  the  IF  statement. 

We  now  discuss  a  general  algorithm  to  transform  any  ff-block  struc- 
ture into  a  basic  structure.   Let  the  set  of  DO  statements  in  the  7T-block 
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be  Dtt  =  {DI  ,  DI  , . .  .  ,DI   },  ND  >  1.   The  set  of  corresponding  index 

variables  is  denoted  by  {i- ,I_, . . . ,1   }.   For  each  DO  statement  DI .  we 

denote  the  lower  bound  of  its  index  variable  set  by  L.  and  the  upper  bound 

by  U..   Let  the  set  of  non-DO  statements  in  the  TT-block  be  denoted  by 

Sir  =  {S,,S_,...,S  },  m  >  1.   For  each  S.  let 
1  I  m  l 

DB.  =  {the  set  of  DO  statements  that  precede  S.  and 
whose  scope  do  not  include  S.} 

■  !DIbi,l'DIbi,2 DIbi,k.>'  ?<«!<*»• 

1 

Moreover,  let 

DA.  =  {the  set  of  DO  statements  that  follow  S.} 
l  l 

=  {DI  .   ,DI  .  0,...,D  .    },  0  <  s.  <  ND. 
ai,l   ai,2      ai,s.  '   -  i 

l 

Then  the  Tr-block  can  be   transformed    to   the   form: 

DIi 

DI„ 


DIND 
B1.C1 

s2.c2 


s  .c 

m  m 

where  C .  is  a  Boolean  variable  which  controls  the  execution  of  S.  .If 
l  l 

C.  is  true  then  S.  is  executed .else  it  is  not.   C.  is  given  by: 
i  l  •  l 

C.  =  {(I,  .   =U,  .  .).AND.(I,  .   =U,  .  _) AND. (I,.    =U.  .  .  ).AND. 

l      bi,l   bi,l        bi,2  bi,2  bi,k.   bi,k. 

l      l 

(I  .   -L  .  -KAND.CI  .   =L  .  0) AND. (I  .    =L  .    )} 

ai,l  ai,l        ai,2   ai,2  ai,s.   ai,s. 


To  illustrate  the  application  of  this  algorithm  consider  the 
Gaussian  elimination  program  shown  below: 
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Program  17-a. 


DIX  DO   S2   I  =  l.N-1 

DI2  DO   S1   I2  =  (Ij+D.N 

51  A(I2,I1)=A(I2,I1)/A(I1,I1) 
DI3  DO   S2   I3  =  (1^1)^ 
DI4  DO   S2   I4  =  (Ij+l),!! 

52  A(I4,I3)=A(I4,I3)-A(I4,I1)*A(I1,I3) 


Here  we  have : 


Dtt  =  {DI  ,DI2,DI  ,DIA} 


Stt  =  {S1,S2> 
DBX  =  (() 

DA  =  {DI3,DI4} 
DB2  =  (DI2) 
DA,,-* 


C   =  I3.EQ.(I1+1).AND.I4.EQ.(I1+1) 


C2  =  I2.EQ.N 

Thus  the  corresponding  basic  7T-block  is  as  follows: 
Program  17-b. 

DI  DO  S    I  =1,N-1 

DI2  DO  S2    Ij-C^+D.H 

DI3  DO  S2    I3-(I+1),N 

DI4  DO  S2    I4=(I1+1),N 

Sj  IF  (I3.EQ.(I1+1).AND.I4.EQ.(I1+D) 


a(i2,i1)  =  a(i2,i1)/a(i1,i1) 


S2      IF    (I2.EQ.N) 


A(I4,I3)  =  A(I4,I3)  -  A(I4,I1)*A(I1,I3) 
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We  note  that  this  algorithm  will  introduce  a  large  amount  of 
control  instructions  when  the  TT-block  is  executed.   This  excessive 
control  can  be  reduced  by  fusing  some  of  the  loops  in  the  7T-block,  when- 
ever possible,  before  expanding  their  scopes.   Note  that  at  this  point 
in  the  transformation  process  we  know  the  data  dependences  in  the  TT- 
block  and  thus  checking  for  the  validity  of  loop  fusion  is  a  trivial 
additional  expense. 

The  combined  loop  expansion-fusion  transformation  can  be 
applied  to  Program  17-a  in  the  following  steps: 


(Expand 

DI3) 

Program  17-c. 

DI1 

DO 

S2 

I1   -   1,(8-1) 

DI3 

DO 

S2 

I3  -    (Ij+D.N 

IF  (I  .EQ.I  +1) 


DI, 


DO 


Sl   12   =  (Ii+1)'N 


DI 


A(I2,I1)  =  A(I2,I1)/A(I1,I1) 

DO    S2      I4  =  (Ij+1) ,N 

A(I4,I3)  =  A(I4,I3)  -  A(I4,I1)*A(I1,I3) 

(Fuse  DI   and  DI4> 


Program  17 -d  . 

DI      DO 
DI„     DO 


DI, 


DO 


11  =  1,(N-1) 
I3  =  (Ij+1) ,N 

12  =  (1^1)  ,N 
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S±  IF  (I3.EQ.I1+1)  A(I2,I1)  =  A(I2,I1)/A(I1,I1) 

S2      A(I2'I3)  =  A(I2'I3)  "  A(I2>I1)*A(I1,I3) 
Thus  in  general,  the  nonbasic  to  basic  TT-block  transformation 
consists  of  a  series  of  loop  expansion  and  fusion  steps.   One  starts  by 
trying  to  fuse  loops  in  the  given  TT-block.   This  is  followed  by  expand- 
ing the  scope  of  the  farthest  reaching  DO  statement  (if  we  associate 
a  CONTINUE  statement  with  each  DO  statement  and  number  these  CONTINUE 
statements  sequentially,  then  the  farthest  reaching  loop  is  the  one  assoc- 
iated with  the  CONTINUE  statement  with  the  largest  label) .   This  process 
of  fusion  followed  by  expansion  is  continued  until  a  basic  tt  structure 
is  reached.   Note  that  to  expand  a  loop  we  use  the  algorithm  presented 
previously  in  this  section. 

For  Program  17-d  the  page  indexing  transformation  can  now  be 
applied  as  shown  below.  (This  is  a  legal  Fortran  version.   Also  note  that 
we  have  substituted  K  for  I  ,  J  for  I  ,  and  I  for  I„). 

Program  17-e 

RZ  =  Z  **  .5 

NP  =fN/RZl 

DO  S   KP  =  1,  NP 

KLB  =1+(KP-1)*RZ 

DO   S    JP  =  KP,  NP 

JLB  =  1  +  (JP  -  1)  *  RZ 

JUB  =  JP  *  RZ 

DO   S2   IP  =  KP,  NP 

TLB  =  1  +  (IP  -  1)  *  RZ 
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IUB   =  IP  *  RZ 

IF  (IP.EQ.KP)  KUB  =  KP  *  RZ  -  1 

IF  (IP.NE.KP)  KUB  =  KP  *  RZ 

DO  S   K  =  KLB,  KUB 

IF  (IP.EQ.KP)  ILB  =  K  +  1 

IF  (JP.EQ.KP)  JLB  =  K  +  1 

DO  S   J  =  JLB,  JUB 

DO  S    I  =  ILB,  IUB 

IF  (J. EQ. JLB. AND. JP.EQ.KP)  A(I,K)  =  A(I ,K)/A(K,K) 

IF  (J.LE.JUB)  A(I,J)  =  A(I,J)  -  A(I,K)  *  A(K,J) 
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4.   EXPERIMENTAL  RESULTS 

The  aim  of  this  chapter  is  to  provide  some  preliminary  experi- 
mental evidence  of  the  usefulness  of  the  transformations  presented  in 
Chapter  Three.   We  will  also  discuss  some  experiments  which  we  performed 
to  investigate  the  concept  of  bounded  locality  intervals  [BATS76a]  and 
the  correlation  between  a  program's  syntactic  structure  and  its  BLI*s 

We  have  chosen  17  Fortran  IV  programs  to  experiment  with.   There 
were  two  reasons  to  select  programs  written  in  Fortran  and  not  in  other 
languages.   First,  there  are  a  large  number  of  all  kinds  of  Fortran  pro- 
grams available  for  experimentation.   Second,  the  current  version  of  the 
PARAFRASE  compiler  accepts  only  Fortran  programs.   We  think  of  the  trans- 
formations presented  in  Chapter  Three  as  modifications  and  extensions 
to  some  of  the  transformations  already  existing  in  the  PARAFRASE  compiler 
in  addition  to  some  new  ones  which  are  specifically  aimed  at  the  en- 
hancement of  the  performance  of  virtual  memory  systems. 

Eleven  of  our  programs  were  chosen  from  a  collection  of  programs 
which  we  got  from  different  national  laboratories.   In  the  other  six  pro- 
grams we  coded  some  standard  matrix  algorithms.   In  selecting  the  eleven 
programs  we  followed  two  guidelines.   First,  we  wanted  a  set  of  programs 
which  was  fairly  representative  of  various  numerical  Fortran  programs. 
We  wanted  the  complexity  of  the  calculations  performed  in  the  programs 
to  vary  from  simple  or  merely  data  movement  operations  to  complex  compu- 
tations.  Second,  we  eliminated  any  programs  which  have  relatively  small 
memory  requirements.   We  required  that  each  of  the  chosen  programs  has  a 
virtual  address  space  of  more  than  twenty  pages. 
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We  have  chosen  the  page  size  to  be  256  bytes  (64  words) .   For 
our  purposes,  the  choice  of  the  page  size  is  not  critical.   We  are 
trying  to  demonstrate  that  programs  which  reference  multi-page  arrays, 
irrespective  of  the  size  of  one  page,  can  be  transformed  to  behave 
better  in  a  paged  virtual  memory  environment.   At  the  end  of  this 
chapter  we  will  discuss  the  effect  of  varying  the  page  size  on  our  re- 
sults.  We  will  show  that  the  effectiveness  of  our  transformations  is 
rather  independent  of  the  page  size.   For  our  purposes,  what  matters  is 
not  the  absolute  value  of  the  size  of  pages  and  the  sizes  of  arrays  but 
their  relative  sizes.   Since  we  are  mostly  interested  in  programs  which 
have  large  virtual  space  (these  are  the  programs  which  usually  can 
have  disasterous  behavior  in  virtual  memory  machines)  a  page  size  of 
256  bytes  seemed  to  be  suitable  to  ensure  that  our  collection  of  pro- 
grams have  large  space  requirements.   As  mentioned  earlier  we  will  re- 
turn to  this  subject  in  much  more  detail  at  the  end  of  this  chapter. 

Table  7  shows  a  brief  description  of  the  programs  used  in  our 
experiments.   The  total  number  of  source  cards  (excluding  comments)  is 
1598.   The  total  number  of  DO  statements  is  200.   We  generate  the  trace 
of  a  program  using  the  arrangement  shown  in  Figure  23 .   The  input  Fortran 
program  is  passed  through  the  scanner  of  the  PARAFRASE  compiler  and  the 
IBM  Fortran  IV  Gl  level  2.0  compiler.   The  output  of  the  Fortran  compiler 
is  a  listing  showing  every  statement  of  the  source  program  and  the  por- 
tion of  the  object  code  associated  with  it.   We  examine  this  output  and 
make  a  list  of  the  statement  numbers  of  those  statements  which  must  be 
executed  by  the  trace  generator.   These  include  any  statements  which 
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calculate  index  variables,  loop  bounds,  or  conditions  of  logical  IF 
statements.   The  trace  generator  receives  the  output  of  the  Fortran 
compiler,  the  program  description  tables  from  the  PARAFRASE  scanner, 
and  a  control  input  which  includes  the  list  of  statements  to  be  executed, 
specification  of  the  storage  scheme  of  multi-dimensional  arrays  (storage 
by  rows,  columns,  or  square  blocks),  the  page  size  in  words,  necessary 
values  for  some  variables  used  in  the  input  program,  and  branching 
probabilities  to  be  used  in  those  IF  statements  for  which  the  test  con- 
dition cannot  be  evaluated  by  the  current  trace  generator.   Thus  the 
trace  generator  will  simulate  a  partial  execution  of  the  input  program 
which  is  sufficient  to  get  an  accurate  trace  of  array  references.   The 
branching  probabilities  and  the  values  to  be  given  to  variables  are 
chosen  with  the  help  of  the  documentation  of  the  input  program  or  by 
personal  communication  with  the  people  who  supplied  the  program. 

In  two  occasions  we  had  to  eliminate  a  loop  in  a  program.  We 
eliminated  the  following  loop  from  the  Fast  Fourier  Transform  program, 
FOURTR : 

DO   S,      1=1,  N2N 

D 

S  IIN1  =  1  +  REVERS  (I) 

S  IIN2  =  1  +  REVERS  (1+1) 

S  CR(I)=  INR(IINl)  +  INR(IIN2) 

S4  CI(I)=  INI(IINl)  +  INI(IIN2) 

S3  CR(I+1)  =  INR(IINl)  -  INR(IIN2) 

S6  CI(I+1)  =  INI(IINl)  -  INI(IIN2) 
This  had  to  be  done  because  the  current  trace  generator  cannot  evaluate 

statements  S..  and  S„  which  is  necessary  to  calculate  some  subscripts  in 
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statements  S„,  S,,  S_,  and  S..   The  current  trace  generator  does  not 
evaluate  expressions  if  they  contain  array  elements. 

In  the  other  occasion  we  eliminated  a  loop  from  the  TWOWAY  pro- 
gram.  This  loop  contained  211  statements  with  several  inner  loops  at 
different  nest  depth  levels.   Analyzing  this  program  with  this  loop 
included  exceeded  the  capabilities  of  the  current  PARAFRASE  compiler. 

Our  original  plans  for  the  experiments  were  to  apply  one  trans- 
formation at  a  time  to  each  of  our  programs  in  order  to  measure  the 
contribution  of  each  transformation  to  the  total  achieved  improvement. 
We  decided  to  abandon  these  plans  for  the  time  being  due  to  the  enormous 
amount  of  results  which  would  be  generated.   Thus  we  applied  all  the 
transformations  possible  to  a  given  program  in  order  to  achieve  the  best 
possible  improvement.   We  used  a  mixture  of  automatic  and  manual  means 
for  applying  the  transformations.   The  data  dependence  relations  were 
analyzed  automatically.   Part  of  the  transformations  were  already  imple- 
mented in  the  PARAFRASE  compiler.   The  clustering  transformation  has  been 
added  to  PARAFRASE  and  work  is  continuing  to  add  the  rest  of  the  trans- 
formations.  To  obtain  our  current  results,  whenever  we  had  to,  we  ap- 
plied the  transformations  manually.   We  would  like  to  emphasize  that  we 
look  at  the  experimental  results  reported  here  as  preliminary  results. 
We  decided  that  initially  it  is  important  to  get  a  feeling  for  the  amount 
of  improvement  which  can  be  achieved  in  the  behavior  of  real  programs 
by  transforming  a  few  programs,  using  automatic  and  manual  means,  and 
examining  the  results  rather  than  waiting  to  fully  automate  the  trans- 
formations before  generating  any  results.   We  feel  that  our  preliminary 
results  serve  as  the  green  light  which  signals  that  the  investment  of 
effort  in  automating  all  our  techniques  is  a  safe  investment. 
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In  Table  8  we  compare  some  of  the  characteristics  of  the  original 
and  transformed  programs.   This  table  is  meant  to  give  a  feeling  for  the 
worst  possible  cost  of  transforming  a  program.   We  will  explain  as  our 
discussion  progresses  why  this  is  the  worst  cost  of  the  transformations. 
For  6  of  the  17  programs  the  number  of  pages  referenced  in  the  transformed 
program  exceeds  the  number  in  the  original  program.   This  is  due  to  the 
scalar  expansion  transformation.   We  note  that  the  maximum  increase  is 
5  pages.   We  also  notice  an  increase  in  the  number  of  array  references 
for  those  programs  where  scalar  expansion  was  used.   This  is  not  a  real 
increase  in  the  number  of  memory  references  to  data  words  in  the  trans- 
formed program.   These  extra  memory  references  reported  for  the  trans- 
formed program  are  also  made  in  the  original  program,  but  to  scalar 
variables.   For  the  original  programs  these  references  were  just  not 
counted  because  we  only  count  references  to  array  elements.   The  increase 
shown  in  the  number  of  source  statements  in  the  transformed  programs  is 
not  really  accurate.   It  is  an  over  estimate.   The  reason  for  this  is 
that  our  current  trace  generator  is  not  very  smart  and  in  many  cases 
we  had  to  insert  redundent  statements  to  make  the  tracer  do  what  it  is 
supposed  to  do.   For  example  the  current  tracer  cannot  evaluate  ILB  in 
the  following  statement: 

IF  (KP.EQ.l)    ILB  =  K+l 
To  achieve  this  assignment  to  ILB  we  do  the  following 

IF  (KP.NE.l)    GO  TO  1 

ILB  =  K  +  1 

1   

Moreover,  our  tracer  does  not  evaluate  functions.   Thus  to  make  the 
assignemnt : 
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IUB  =  MIN  (N,IP*Z) 
we  do  the  following 

IUB  =  IP*Z 

IF(IUB.LE.N)  GO  TO  2 

IUB  =  N 

2    

These  and  other  inefficiencies  in  our  tracer  lead  also  to  an  over  es- 
timation of  the  increase  in  the  number  of  instructions  executed  in  the 
transformed  programs.   The  more  pronounced  increase  in  the  number  of 
executed  instructions  for  programs  CD,  LUD,  and  GE  is  mainly  due  to  the 
nonbasic  to  basic  ir-block  transformation.   Our  current  implementation 
of  this  transformation  introduces  an  appreciable  amount  of  control 
instructions.   Further  effort  needs  to  be  made  to  improve  the  implemen- 
tation of  this  transformation.   In  Chapter  5  we  make  some  suggestions 
concerning  this  point. 

Our  experiments  fall  in  three  categories.   In  the  first  we 
implemented  the  algorithms  described  in  [BATS76a]  to  find  the  BLI's 
of  our  programs  and  their  transformed  versions.   The  purpose  of  these 
experiments  is  to  investigate  the  validity  of  the  BLI  concept  in  defining 
the  localities  of  a  program.   Moreover,  we  wanted  to  compare  the 
characteristics  of  the  localities  found  in  a  program  to  those  found  in 
its  transformed  version.   We  also  wanted  to  compare  our  findings  to  the 
experimental  results  reported  in  [BATS76a].   We  will  discuss  all  these 
issues  in  Section  4.1. 

In  the  second  category  of  experiments  we  simulated  the  local 
LRU  memory  management  algorithm  and  generated  the  page-faults  vs.  memory 
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allotment  and  the  space-time  cost  vs.  memory  allotment  curves  of  every 

program  and  its  transformed  version.   The  purpose  is  to  compare  the  cost 
of  executing  original  and  transformed  programs  under  LRU.  We  have 
chosen  the  LRU  algorithm  because  it  is  known  to  be  the  best  among  the 
heuristic  replacement  algorithms  and  because  most  of  the  existing 
virtual  memory  machines  use  some  sort  of  an  LRU  algorithm  for  memory 
management  [ScHE73] , [ JONE72] .   The  results  of  these  simulations  are  dis- 
cussed in  Section  4.2. 

The  third  category  of  experiments  are  designed  to  investigate 
the  important  question  of  finding  whether  there  are  any  merits  for  using 
variable  memory  allotment  policies  as  compared  to  using  fixed  memory 
allotment  policies  for  the  memory  management  of  transformed  programs. 
We  have  chosen  to  use  the  working  set  management  policy  as  a  represen- 
tative of  variable  memory  policies  [DENN68] .   We  compared  the  space- 
time-cost  for  the  transformed  programs  under  the  LRU  and  working  set 
policies.   For  several  programs  we  encountered  the  real  memory-fault 
rate  and  parameter-real  memory  anomalies  as  described  in  [FRAN78].   This 
point  and  the  LRU-working  set  comparison  will  be  discussed  in  Section  4.3. 

In  Section  4.4  we  summarize  the  implication  of  our  findings  and 
investigate  the  sensitivity  of  our  results  to  the  page  size. 
4 . 1  Measuring  the  Characteristics  of  Program  Localities 

To  measure  the  characteristics  of  program  localities  one  has 
first  to  identify  these  localities.   This  can  easily  be  done  for  the 
transformed  versions  of  our  collection  of  programs  because  they  follow 
the  ELM.   In  a  transformed  program,  whenever  a  7T-block  is  being  executed, 
the  reference  string  will  stay  within  one  locality  interval.   The  MTBR 
to  every  page  of  this  locality  is  small,  0(R„),  where  Rp  is  the  number 
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of  array  references  made  per  iteration  of  the  innermost  loop  of  the  tt- 
block.  The  density  of  references  to  a  page  is  high.  Hence  for  trans- 
formed programs  one  can  identify  localities,  count  the  number  of  pages 
referenced  in  each  locality,  and  its  duration. 

Loops  in  untransformed  programs  do  not  in  general  follow  the 
ELM  and  hence  it  is  not  easy  to  identify  localities  and  measure  their 
characteristics . 

Thus  one  can  measure  the  characteristics  of  localities  in 
transformed  programs  but  cannot  compare  these  measurements  in  an  accurate 
way  to  measurements  made  on  the  original  programs.   The  localities  of 
original  programs  are  simply  not  well  defined!   The  locality  of  reference 
of  untransformed  programs  is  a  vague,  loose,  and  unquantif iable  concept. 

The  work  of  Batson  and  Madison  [BATS76a] , [BATS76b] ,  is  the 
only  effort  previously  made  to  identify  localities  in  reference  strings 
of  programs.   In  Chapter  2  we  have  shown  that  there  are  several  problems 
with  the  concept  of  BLI's  as  developed  in  [BATS76a] .   We  confirmed  the 
existance  of  these  problems  by  implementing  Batson' s  algorithms  and 
finding  the  BLI's  of  our  programs.   We  then  correlated  the  BLI's  structure 
of  a  program  to  its  syntactic  structure.   We  made  assumptions  which  are 
identical  to  those  made  by  Batson.   He  assumed  that  there  is  a  one-to-one 
correspondence  between  array  names  and  segment  identifiers.   In  other 
words,  he  assumed  a  segmented  virtual  memory  system  in  which  the  segment 
size  can  vary.   Each  array,  irrespective  of  its  size,  is  stored  in  a 
single  segment. 

After  using  the  BLI's  generated  for  our  programs  to  investigate 
the  correctness  of  the  BLI  concept  and  find  its  problems,  we  decided  to 
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use  the  resulting  data  for  other  purposes.   If  meaningless  and  misleading 
BLI's  are  discarded  one  can  identify  those  BLI's  that  correspond  to 
loops.   Thus  by  carefully  examining  the  BLI's  of  a  transformed  program 
one  can  find  the  duration  of  execution  of  each  TT-block  and  the  number 
of  referenced  arrays.   This  gives  the  size  and  lifetime  of  true  locali- 
ties in  transformed  programs.   For  the  untransf ormed  programs  we  identi- 
fied the  BLI's  corresponding  to  outermost  loops  and  recorded  their 
duration  and  number  of  referenced  arrays.   Our  findings  will  be  discussed 
in  Section  4.1.1. 

We  used  the  same  techniques  discussed  in  the  previous  para- 
graph to  collect  data  about  the  size  and  duration  of  localities  for 
paged  virtual  memory  systems.   In  this  case  an  array,  depending  on  its 
size,  will  span  several  256  byte  pages.   In  a  transformed  program,  when 
a  TT-block  is  executed,  one  locality  set  of  pages  will  be  referenced  after 
another.   We  collected  data  about  the  size  and  duration  of  localities  of 
a  program  by  carefully  examining  its  BLI's  when  generated  under  a  paged 
system  assumption.   For  the  untransf ormed  programs  we  collected  data 
about  the  number  of  pages  referenced  in  BLI's  that  correspond  to  outer- 
most loops.   Our  findings  are  discussed  in  Section  4.1.2. 


4.1.1  Localities  in  Segmented  Systems 

Because  of  the  kind  of  segmented  system  we  have  assumed  in  this 
section,  we  do  not  include  any  data  from  programs  CD,  FLR,  GE,  LUD, 
MATMUL,  and  MATTRP .   Including  data  from  these  programs  would  have  biased 
our  findings  towards  localities  of  small  sizes.   In  each  of  the  programs 
MATTRP,  LUD,  CD,  and  GE  only  one  segment  is  referenced.   In  FLR  two 
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segments  are  referenced  and  in  MATMUL  three  segments  are  referenced. 
In  selecting  programs  for  his  experiments  Batson  rejected  any  programs 
which  reference  less  than  six  arrays .   In  Table  9  we  compare  some  of 
the  characteristics  of  his  programs  and  our  programs  (excluding  the  six 
previously  mentioned).   We  note  that  our  programs  have,  on  the  average, 
fewer  arrays.   Hence  the  locality  of  our  untransformed  programs  is 
slightly  better  than  Batson' s  programs.   Thus  the  improvement  results 
which  will  be  reported  are  on  the  conservative  side.   The  results  would 
have  been  even  better  if  we  had  Batson 's  collection  or  programs  with 
more  arrays.   This  fact  is  emphasized  by  Figure  25  which  will  be  dis- 
cussed shortly. 


Table  9.   Comparing  Some  Characteristics  of  Our  Programs  and  those 
Used  by  Batson  and  Madison. 


Our  Programs 


Batson  and  Madison 
Programs 


Number  of  Arrays 
Referenced  in  a  Program 

Minimum  6  6 

Average  24.3  26.1 

Maximum  57  127 

Size  of  the  Reference 
Strings 

Minimum  11  152  5  459 

Average  71  651  42  857 

Maximum  236  027  102  227 
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Before  discussing  our  results  we  make  one  more  remark.   Our 
programs  were  transformed  with  the  assumption  that  they  will  run  in  a 
paged  system  with  a  page  size  of  256  bytes.   However,  it  is  not  diffi- 
cult to  deduce  the  characteristics  of  the  localities  if  the  programs 
were  transformed  to  run  on  a  variable  segment  size  system.   The  only 
thing  we  have  to  do  is  to  eliminate  the  effect  of  the  page  indexing 
transformation  on  the  generated  data.   For  example  consider  the  following 
program: 

Program  18-a. 

DO  10   I  =  1,16 
DO   10  J  =  1,16 
A(I,J)  =  B(I,J)  +  C(I,J) 
10   D(I,J)  =  B(I,J)/2 
For  a  variable  segment  size  virtual   memory  system  this  program  will 
be  transformed  as  follows: 
Program  18-b. 

DO   101   I  =  1,16 
DO   101   J  =  1,16 

101  A(I,J)  =  B(I,J)  +  C(I,J) 
DO   102   I  =  1,16 

DO   102   J  =  1,16 

102  D(I,J)  =  B(I,J)/2 

The  resulting  BLI  structure  is  shown  in  Figure  24-a.  We  have  two 
localities.   The  first  includes  segments  A,  B,  and  C  and  lasts  for  768 
memory  references.   The  second  locality  includes  segments  B  and  D  and 
lasts  for  512  memory  references.   If  the  loop  of  program  18-a  was  in  one 
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of  our  collection  of  programs,  it  would  have  been  vertically  distributed 
as  shown  below: 

Program  18-c. 

DO   10   IP  -  1,2 

ILB  =  1  +  (IP-1)*8 

IUB  =  IP*8 

DO   10   JP  =  1,2 

JLB  =  1  +  (JP-1)*8 

JUB  =  JP*8 

DO    101    I  =  ILB, IUB 

DO   101   J  =  JLB, JUB 

101  A(I,J)  =  B(I,J)  +C(I,J) 
DO    102    I  =  ILB, IUB 
DO   102    J  =  JLB, JUB 

102  D(I,J)  =  B(I,J)/2 
10    CONTINUE 

The  BLI's  of  this  program  are  shown  in  Figure  24 -b.   It  can  easily  be 
seen  that  the  localities  for  the  segmented  case  in  Figure  24-a  can  be 
found  from  those  in  Figure  2  4-b  by  lumping  into  one  locality  all  the  BLI's 
which  have  the  members  A,B,C.   In  this  way  we  get  the  locality  in  figure 
24-a  with  the  members  A,B,  and  C  and  with  the  duration  192*4  =  768. 
Similarly  we  get  a  locality  of  duration  128*4  =  512  and  with  the  mem- 
bers B  and  D  by  simply  lumping  in  one  locality  all  the  BLI's  of  Figure 
24-b  in  which  these  arrays  are  referenced. 

Figure  25  shows  the  characteristics  of  localities  for  the 
transformed  programs.   In  the  11  programs  a  total  of  756  121  references 
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were  made.   753  859  references  were  made  when  the  programs  were  executing 
within  localities.   These  make  99.7%  of  the  total  number  of  references. 
Part  of  the  remaining  .3%  of  the  references  were  made  outside  loops.   The 
other  part  of  the  .3%  can  be  attributed  to  the  fact  that  the  BLI  method 
is  not  exact  in  finding  the  duration  of  a  locality.   As  shown  in  the 
figure  more  than  48%  of  the  references  were  made  while  the  transformed  pro- 
grams were  executing  within  localities  of  size  2  or  less.   More  than  97% 
of  the  references  were  made  within  localities  of  size  5  or  less. 

In  Figure  25  we  also  show  data  for  our  untransformed  programs 
and  Batson's  programs  [BATS76a] .   For  Batsons'  programs  the  figure  shows 
the  distribution  of  array  references  on  level  one  BLI's  of  different 
sizes.  For  our  untransformed  programs  the  data  represents  the  distribu- 
tion of  array  references  on  BLI's  which  correspond  to  outermost  loops. 
If  we  accept  the  argument  that  the  data  of  our  untransformed  programs 
and  Batson's  data  do  not  represent  very  different  things,  then  one  can 
deduce  from  the  figure  that  our  programs  are  more  local  than  Batson's. 
While  45%  of  references  are  issued  in  level  one  BLI's  of  size  less  than 
or  equal  to  5  segments  in  Batson's  programs,  almost  70%  of  the  references 
in  our  programs  are  made  in  loops  with  5  or  less  arrays.   Thus,  as  was 
mentioned  earlier,  our  reported  improvement  results  are  on  the  conserva- 
tive side  because  untransformed  programs  can  be  less  local. 

One  can  get  an  intuitive  idea  about  the  improvement  achieved 
by  our  transformations  by  comparing  the  data  of  the  original  and  trans- 
formed programs  in  Figure  25.   Because  of  the  assumptions  made  when  the 
data  was  generated  (one  segment  per  array)  the  improvements  which  we  see 
here  underestimate  drastically  the  power  of  the  transformations.   The 
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power  of  the  transformations  will  be  more  fairly  seen  when  paged  virtual 
memory  systems  are  discussed. 

As  was  mentioned  previously,  in  the  transformed  programs  more 
than  97%  of  the  references  are  made  in  localities  of  size  5  or  less. 
For  the  untransf ormed  programs  only  85%  of  the  references  are  made  in 
outermost  loops  with  14  or  less  arrays.   While  almost  50%  of  the  refer- 
ences in  the  transformed  programs  are  made  in  localities  of  size  2  or 
less,  only  30%  of  the  references  in  the  original  programs  are  made  in 
loops  with  4  or  less  arrays. 

4.1.2  Localities  in  Paged  Systems 

The  general  intuitive  impression  one  gets  from  examining  Figure 
25  is  that  the  locality  of  untransformed  programs  is  not  really  that  bad 
under  the  assumptions  of  the  previous  section.   Almost  80%  of  the  ref- 
erences are  made  in  loops  with  6  or  less  segments  (arrays).   More  than 
98%  of  the  references  are  made  in  loops  with  15  or  less  segments.   Since 
the  number  of  segments  in  our  programs  varied  between  6  and  57  with  an 
average  of  25.3,  then  their  locality  is  good.   One  can  arrive  at  similar 
conclusions  from  examining  the  data  representing  Batson's  programs. 

Virtual  memory  systems,  however,  face  their  serious  problems 
when  they  execute  programs  for  which  the  assumptions  of  the  previous 
section  do  not  hold.   Batson's  programs  were  selected  from  the  daily  work 
load  of  the  University  of  Virginia  computing  center.   They  were  executed 
on  the  B5500  computer  which  supports  a  variable  segment  size  virtual 
memory  system.   The  segment  size  can  take  values  between  1  and  1023  words. 
Since,  in  his  programs,  there  was  a  one-to-one  correspondence  between 
array  names  and  segment  identifiers,  none  of  the  programs  had  an  array 
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larger  than  1023  words.   Although  in  our  programs  there  are  many  arrays 
which  are  larger  than  1023  words,  we  still  assumed  that  each  array  will 
occupy  one  segment  when  we  generated  the  data  of  the  previous  section  . 
We  were  interested  in  investigating  the  BLI  concept  and  in  finding  a 
lower  bound  in  some  sense  on  the  improvements  achieved  by  our  transforma- 
tion techniques. 

When  multi-segment  or  multi-page  arrays  are  referenced  in  pro- 
grams, their  degree  of  locality  becomes  drastically  low.   This  is  because, 
in  general,  there  is  no  one-to-one  correspondence  between  the  number  of 
array  names  referenced  per  iteration  of  a  loop  and  the  number  of  pages 
referenced.   In  [ELSH74]  it  was  shown  that  in  a  paged  system  the  locality 
of  a  matrix  multiplication  program  which  makes  references  only  to  3  array 
names  can  be  improved  drastically  by  using  some  rules  in  accessing  the 
elements  of  these  multi-page  arrays.   Batson  in  [BATS76b]  points  out  that 
the  implications  of  his  measurements  of  program  localities  do  not  apply 
to  paged  systems.   We  quote,  "Thus  it  seems  clear  that  major  phases, 
with  relatively  small  activity  sets,  span  the  major  part  of  the  execution 
epochs  of  programs.   This  phenomenon,  otherwise  known  as  locality  of 
reference,  is  the  raison  d'etre  for  the  successful  operation  of  symbolically- 
segmented  virtual  memory  systems.   Its  implications  for  paged  virtual 
memory  systems  are  less  promising,  since  there  is  no  correspondence  in 
general  between  pages  and  symbolic  segments." 

As  we  have  mentioned  in  Chapters  2  and  3,  our  transformations 
serve  two  purposes.  First,  they  make  all  loops  behave  like  elementary 
loops  for  which  the  number  of  pages  referenced  is  highly  correlated  to 
the  number  of  array  names.   Thus  for  transformed  programs  there  will  be  a 
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one-to-one  correspondence  between  array  names  and  pages  referenced. 
Second,  the  transformations  will  reduce  the  cost  of  executing  programs 
in  a  paged  system,  namely  the  space-time  cost,  the  number  of  page  faults, 
and  the  amount  of  memory  allotment  required. 

Figure  26  supports  our  argument.   We  have  generated  the  BLI's 
of  our  17  original  programs  and  their  transformed  versions.   Here  we 
assume  a  paged  system  with  a  page  size  of  64  words  (256  bytes).   For  the 
transformed  programs  the  data  in  the  figure  represents  the  percentage  of 
array  references  made  while  the  programs  executed  with  locality  sizes  of 
a  particular  number  of  pages  or  less.   We  got  this  data  by  careful  cor- 
relation of  the  generated  BLI's  to  the  source  programs.   For  the  untrans- 
formed  programs  the  data  represents  the  percentage  of  references  made 
while  the  programs  executed  in  BLI's  of  sizes  equal  to  or  less  than  a 
particular  number  of  pages.   The  BLI's  correspond  to  outermost  loops. 

In  the  figure  we  see  that  for  the  transformed  programs  more  than 
71%  of  the  1  483  921  array  reference  were  made  in  localities  of  size  3 
pages  or  less.   83%  were  made  in  localities  of  size  5  pages  or  less  and 
more  than  97%  of  the  references  were  made  in  localities  of  size  8  pages 
or  less.   If  we  compare  Figures  25  and  26  we  find  that  the  locality  of 
the  transformed  programs  is  comparable  for  both  paged  and  segmented  sys- 
tems.  The  only  noticable  difference  is  that  the  percentage  of  references 
made  in  localities  of  3  pages  is  higher  than  the  percentage  made  in 
localities  of  3  segments.   This  is  because  the  results  shown  in  Figure  26 
include  data  from  the  six  programs  which  we  excluded  from  our  experiments 
in  the  previous  section.   The  transformed  versions  of  five  of  these  pro- 
grams (CD,FLR,GE,MATNUL,  and  MATTRR)  issue  their  references  in  localities 
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of  3  pages.   Program  LUD  issues  most  of  its  references  in  localities  of 
size  6  or  less. 

For  the  untrans formed  programs,  the  results  shown  in  Figures 
25  and  26  are  very  different.   While  more  than  80%  of  the  references 
are  made  in  loops  with  10  or  less  segments  (array  names),  only  30%  of 
the  references  are  made  when  programs  execute  loops  that  correspond  to 
BLI's  of  size  10  or  less  pages.   In  Figure  25  more  than  98%  of  the 
references  are  made  in  loops  corresponding  to  BLI's  of  size  15  segments 
or  less.   In  Figure  26  only  92%  of  the  references  are  made  in  loops  that 
correspond  to  BLI's  of  size  60  pages  or  less. 

Thus  it  is  clear  that  the  locality  of  untransformed  programs  is 
not  as  good  in  paged  systems  as  it  is  under  the  assumptions  of  the 
variable  segment  size  systems.   Moreover,  Figure  26  proves  that  our  trans- 
formations have  succeeded  in  establishing  a  one-to-one  correspondence 
between  the  number  of  array  names  referenced  in  a  locality  and  the  num- 
ber of  pages  referenced.   Figure  26  also  shows  the  appreciable  reduction 
in  the  size  of  program  localities  achieved  by  the  transformations. 

4 .2  Measuring  the  Performance  Improvement  of  Paged  Virtual  Memory 

Systems  -  the  Fixed  Memory  Allotment  Case 

In  this  section  we  discuss  our  experimental  results  which  compare 
the  performance  of  a  virtual  memory  computer  when  it  executes  untrans- 
formed programs  to  its  performance  when  it  executes  transformed  programs. 
We  have  measured  the  number  of  page  faults  generated  as  a  function  of 
memory  allotment  for  all  our  programs  and  their  transformed  versions.   The 
replacement  algorithm  used  was  the  LRU  algorithm.   All  these  page  faults 


150 


curves  are  included  in  the  Appendix.   These  curves,  as  discussed  in  Chapter 
2,  are  relevant  to  measuring  the  performance  of  monoprogrammed  systems. 
To  make  comparisons  for  a  multiprogrammed  machine,  we  have 
measured  the  space-time  cost  for  the  original  and  transformed  programs 
as  a  function  of  memory  allotment.   These  curves  are  also  included  in 
the  Appendix. 

In  Section  4.2.1  we  discuss  the  page  fault  curves  and  in 
Section  4.2.2  we  discuss  the  space-time  cost  curves. 

4.2.1   The  Page  Faults  vs.  Memory  Allotment  Results 

If  the  total  memory  requirement  of  a  program  is  less  than  or 
equal  to  the  physical  primary  memory  available  on  a  monoprogrammed  machine, 
then  there  is  really  no  advantage  to  using  a  virtual  memory  operating 
system  over  a  nonvirtual  memory  system  in  handling  the  memory  management 
problem  of  the  machine.   The  program  simply  gets  all  the  memory  it 
needs  under  both  systems.   The  virtual  memory  system,  however,  is 
superior  to  the  non-virtual  memory  system  when  programs  are  to  be  executed 
with  memory  needs  which  exceed  the  available  primary  memory  size  of 
the  machine.   In  the  non-virtual  memory  system  the  programmer  must 
manually  take  care  of  the  overlay  problem,  i.e.,  moving  parts  of  the  code 
and  data  of  his  program  between  primary  and  secondary  memories  during 
the  execution  of  his  program.   In  a  virtual  memory  machine,  however,  this 
is  done  automatically  by  the  OS. 

A  paged  virtual  memory  system  moves  parts  of  the  code  and  data 
of  a  program  (called  pages)  between  the  different  levels  of  a  memory 
hierarchy  when  page  faults  occur.   Thus  the  amount  of  information 


151 

transferred  between  secondary  and  primary  memory  is  proportional  to  the 
number  of  page  faults. 

Here  we  are  mainly  interested  in  comparing  the  performance  of  two 
monoprogrammed  virtual  memory  machines.   The  first  runs  our  original 
untransformed  programs  and  the  second  runs  our  transformed  programs. 
Both  use  the  same  replacement  algorithm,  LRU.   How  much  better  the 
second  machine  does  can  be  used  as  a  measure  of  the  power  of  our  trans- 
formation techniques.   The  comparison  can  be  done  in  two  ways.   First, 
the  two  machines  can  be  given  the  same  amount  of  primary  memory  and  then 
the  number  of  page  faults  can  be  compared.   In  the  second  approach  one 
can  ask  for  the  same  performance,  goodness,  or  efficiency  from  both 
machines  and  then  compare  the  amount  of  primary  memory  which  must  be  in- 
stalled on  each  machine  to  achieve  this  given  level  of  performance.   One 
needs,  however,  to  define  a  figure  of  merit  to  measure  the  level 
of  performance.   It  seems  that  the  ratio  of  the  number  of  distinct  pages 
to  the  number  of  page  faults  would  be  the  appropriate  figure  of  merit. 
Thus  if  the  total  number  of  distinct  pages  referenced  during  the  exe- 
cution of  a  program  is  DP,  then  the  level  of  performance  of  a  monopro- 
grammed system  with  a  given  amount  of  primary  memory,  m,  can  be  defined 
as  follows: 

Level  of  performance,  LP(m)  —  DP/f (m)  <  1 
where  f(m)  is  the  number  of  page   faults  with  m  page  frames  of  primary 
memory. 

Comparing  the  page  faults  as  a  function  of  memory  allotment  for 
our  two  machines,  the  first  executing  the  untransformed  programs  and  the 
second  executing  the  transformed  programs,  can  be  done  by  examining  the 


152 

page  fault  curves  of  the  untransformed  and  transformed  programs  in  the 
Appendix.    A  full  appreciation  of  the  amount  of  improvement  can  only 
be  achieved  by  examining  and  commenting  on  the  curves  of  each  program 
individually.   We  will  not  go  into  such  a  discussion,  however,  because 
it  would  be  very  lengthy  and  we  will  leave  it  to  the  reader  to  decide 
how  much  time  he  wants  to  spend  staring  at  the  curves  and  drawing  con- 
clusions.  We  will  instead  present  in  Table  10  an  overview  of  the  im- 
provement achieved  for  the  memory  range  between  4  and  8  pages.   We  have 
chosen  this  memory  range  because  as  discussed  in  Section  4.1,  programs 
with  good  locality  will  spend  most  of  their  execution  time  in  localities 
of  sizes  between  say  3  or  4  and  8  pages.   Testing  of  the  effectiveness 
of  the  transformations  is  fairest  when  it  is  done  in  such  a  memory  range. 
Note  that  in  general  with  any  given  memory  allotment  a  transformed 
program  should  never  generate  more  page  faults  than  the  original  program. 
There  are  two  exceptions  to  this  general  rule.   In  the  first  case,  due 
to  scalar  expansion,  the  transformed  program  will  generate  more  page 
faults  at  very  low  memory  allotment  (when  it  is  thrashing)  and  also  with 
large  memory  allotment  (because  the  number  of  distinct  pages  referenced 
in  the  transformed  program  will  be  larger).   For  this  case,  the  additional 
page  faults  of  the  transformed  program  are  not  real.   This  increase  would 
not  exist  if  we  counted  references  to  scalar  variables  in  the  original 
program.   In  the  second  case,  due  to  loop  fusion  in  the  nonbasic  to  basic 
TT-block  transformation,  the  transformed  program  may  generate  more  page 
faults  at  low  memory  allotments.   Let  us  remember  that  when  transforming 
nonbasic  Tr-blocks,  loop  fusion  is  mainly  used  to  reduce  the  number  of 
extra  control  instructions  generated.   This  is  done  whether  the  fused 
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Table  10.   The  Ratio  of  the  Number  of  Page  Faults  of  the  Original  to 
the  Transformed  Programs  for  4<m<8. 


Ratio 
Program\  f(4)/ffc(4)  f(5)/ft(5)  f(6)/ft(6)  f(7)/ft(7)  f(8)/ft(8) 


ADVECT 

3.58 

19.26 

36.78 

33.71 

34.40 

BASE 

29.56 

54.01 

55.11 

55.73 

55.26 

BIGEN 

36.93 

1.42 

1.00 

1.00 

1.00 

CD 

39.27 

37.12 

24.59 

25.13 

13.95 

DISPERSE 

5.78 

4.40 

5.06 

5.13 

5.11 

FIELD 

22.59 

26.75 

30.47 

35.24 

38.72 

FLR 

6.65 

5.76 

4.32 

2.83 

1.00 

FOURTR 

1.10 

16.33 

40.25 

41.64 

46.22 

GE 

53.04 

53.18 

50.00 

50.86 

45.92 

INIT 

5.87 

5.78 

3.46 

3.46 

3.46 

LUD 

.1 

.12 

34.97 

31.12 

20.44 

MAIN 

5.11 

7.32 

6.85 

7.10 

6.72 

MAMOCO 

1.07 

1.10 

3.52 

3.59 

4.28 

MATMUL 

58.91 

58.91 

58.91 

58.91 

58.91 

MATTRP 

6.88 

5.48 

5.48 

3.52 

3.52 

PAPUAL 

1.31 

1.31 

1.31 

1.31 

7.87 

TWOWAY 

2.90 

2.69 

7.55 

16.02 

17.46 

MIN. 

.1 

.12 

1.00 

1.00 

1.00 

AVG. 

16.51 

17.70 

21.74 

22.13 

21.42 

MED. 

5.87 

5.78 

7.55 

16.02 

13.95 

MAX. 

58.91 

58.91 

58.91 

58.91 

58.91 
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loops  reference  similar  pages  or  not.   This  increase  in  the  number  of 
page  faults  will  disappear,  however,  as  slightly  more  memory  is  alloted . 

In  Table  10    f  is  the  page  fault  function  of  the  untransformed 
programs  and  f   is  the  page  fault  function  for  the  transformed  programs, 
m  is  the  memory  allotment  in  pages.   The  table  shows  the  ratio  of  f/f 
at  memory  allotments  of  A,  5,  6,  7,  and  8  pages.  We  note  that  the 
average  improvements  at  these  page  allotments  are  16.51,  17.70,  21.74, 
22.13,  and  21.42  respectively.   The  average  of  these  averages  is  19.9. 
With  a  memory  allotment  of  4  pages,  the  factor  of  improvement  is  greater 
than  36  for  4  programs,  between  22  and  30  for  two  programs,  between  5  and 
7  for  5  programs,  between  2  and  4  for  2  programs,  and  less  than  2  for  4 
programs.   With  6  pages,  the  factor  of  improvement  is  greater  than  30 
for  7  programs,  around  25  for  1  program,  between  4  and  8  for  5  programs, 
between  3  and  4  for  2  programs  and  less  than  2  for  2  programs.   Finally 
with  8  pages,  the  factor  of  improvement  is  greater  than  34  for  6  programs, 
between  13  and  20  for  3  programs,  between  4  and  8  for  4  programs,  be- 
tween 3  and  4  for  2  programs,  and  no  improvement  for  2  programs.   We  note 
that  only  one  transformed  program  produced  more  page  faults  than  the 
original  program  at  m  =  4  and  5.   This  happened  in  program  LUD  because 
we  used  loop  fusion  while  transforming  its  nonbasic  into  basic  7T-block. 
For  memory  allotments  greater  than  6,  however,  the  transformed  program 
produces  fewer  page  faults. 

We  now  use  the  second  approach  to  measure  the  achieved  improve- 
ments.  Namely  we  will  compare  the  amount  of  memory  required  in  the  un- 
transformed programs  machine  to  the  memory  required  in  the  transformed 
programs  machine  when  both  operate  at  the  same  level  of  performance. 
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Examining  the  page  fault  curves  in  the  Appendix  one  notes  that  for  both 
the  transformed  and  untransformed  programs  the  curves  are  monotonically 
decreasing.   For  the  untransformed  programs,  the  drop  in  page  faults  as 
memory  allotment  increases  is  rather  gradual  in  most  parts  of  the  curves. 
In  some  instances  fast  drops  in  faults  do  occur.   The  relative  magni- 
tudes of  such  drops  are  rather  small  when  they  happen  at  small  memory 
allotment.   If  large  sudden  drops  in  faults  are  observed  they  usually 
occur  at  large  memory  allotments.   Eventually  the  curves  will  be  asymp- 
totic to  the  absolute  minimum  number  of  page  faults,  DP. 

The  page  fault  curves  of  the  transformed  programs  follow  a  much 
more  consistent  pattern.   All  transformed  programs  encounter  a  steep  drop 
in  page  faults  at  some  memory  allotment  between  4  and  8  pages.   We  will 
call  the  points  in  the  page  fault  curves  where  this  happens  the  knee 
points .   The  memory  allotment  at  the  knee  point  is  denoted  by  m,   and 
the  page  faults  of  the  transformed  program  will  be  f  (m,  ) .   Beyond  the 
knee  point,  m  >  m   ,  the  f   curves  approach  their  asymptotic  values  with 

Kl  L 

small  slopes.   Since  page  fault  curves  are  in  general  not  smooth  curves, 
i.e.  the  slopes  change  abruptly,  we  cannot  choose  a  particular  slope  to 
find  the  exact  location  of  the  knee  point  for  each  curve.   For  example 
saying  that  the  knee  point  is  the  point  at  which  the  slope  of  the  curve 
is  135°  would  not  work.   Examining  the  space-time  curves  for  the  trans- 
formed programs,  we  noticed  that  the  memory  allotment  at  the  absolute 
minimum  space-time  cost  points  can  be  used  to  identify  the  knee  points  in 
the  page  fault  curves.   If  we  denote  the  memory  allotment  at  the  minimum 
space-time  cost  point  of  a  transformed  program  by  m   ,  then  we  choose 
to  take  hl   =  m   .   Using  this  method  of  finding  the  value  of  iil   was 
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successful  in  locating  the  steep  drop  regions  in  the  page  fault  curves. 
Although  m   and  m   have  the  same  value  for  each  transformed  program, 
we  wish  to  use  two  symbols  to  emphasize  the  distinction  between  the  dis- 
cussion of  mono  and  multiprogrammed  systems. 

Let  us  now  restate  what  we  are  trying  to  do.   We  want  to  compare 
the  memory  allotment  needed  in  the  machine  executing  the  untransformed 
programs  to  the  memory  needed  in  the  machine  exeuting  the  transformed 
programs  while  both  machines  operate  at  the  same  level  of  performance. 
Here  we  have  to  decide  on  the  levels  of  performance  to  be  used  in  making 
the  comparisons.   We  will  make  two  sets  of  comparisons.   In  the  first  set 
we  take  the  performance  level  achieved  by  the  transformed  programs  at 
m  =  m   to  be  the  comparison  level.   In  other  words  we  will  compare  m, 

and  m    ,  where  f  (iil  )  <  f  (m  .  )  (the  less  than  sign  is  used  because  the 
ckt         t   k.t       ckt 

f  curves  are  not  continuous  curves).   Thus,  m    is  the  memory  allotment 
needed  by  the  untransformed  program  to  generate  no  less  than  f  (m,  ) 
page  faults.   This  type  of  comparison  shows  the  value  of  the  transfor- 
mations for  each  program  individually  because  m,   is  in  general  different 
for  different  programs.   In  the  second  set  of  comparisons  we  are  more 
interested  in  the  improvements  across  the  programs  from  the  OS  point  of 
view.   In  other  words,  if  the  machine  of  the  transformed  programs  has  only 
4  page  frames  to  be  alloted  to  each  of  these  programs,  then  it  is 
interesting  to  know  the  number  of  page  frames  needed  by  the  untransformed 
programs  machine  to  achieve  the  same  level  of  performance.   We  will  do 
this  comparison  with  4,  6,  and  8  page  frames. 

Table  11  shows  the  results  of  the  first  set  of  comparisons. 
We  note  that  iil   ranged  from  1  to  8  with  an  average  of  4.53.   The  median 
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Table  11.  Memory  Requirements  of  Transformed  and  Original  Programs  at 
Similar  Performance  Levels  -  the  Transformed  Programs  Knee 
Points  Level. 


Program 

\t 

\t/DP 

LP(mkt> 

mckt 

WDP 

mckt/mkt 

ADVECT 

6 

.0265 

.229 

31 

.137 

5.17 

BASE 

5 

.0167 

.817 

38 

.127 

7.60 

BIGEN 

2 

.0052 

.877 

5 

.013 

2.50 

CD 

3 

.1429 

.231 

11 

.537 

3.67 

DISPERSE 

3 

.0041 

.762 

60 

.082 

20.00 

FIELD 

8 

.1538 

.853 

18 

.346 

2.25 

FLR 

2 

.0870 

.821 

7 

.304 

3.5 

FOURTR 

6 

.0468 

.133 

65 

.508 

10.8 

GE 

3 

.0833 

.229 

35 

.972 

11.67 

INIT 

1 

.0041 

1.00 

64 

.267 

64 

LUD 

6 

.1667 

.231 

22 

.611 

3.67 

MAIN 

5 

.0252 

.240 

26 

.131 

5.2 

MAMOCO 

6 

.0068 

.360 

30 

.034 

5 

MATMUL 

3 

.0400 

.273 

34 

.453 

11.3 

MATTRP 

2 

.0800 

1.00 

9 

.360 

4.5 

PAPUAL 

8 

.0056 

.989 

176 

.124 

22 

TWOWAY 

8 

.0283 

.115 

56 

.199 

7 

MIN. 

1 

.0041 

.115 

5 

.013 

2.25 

AVG. 

A. 53 

.0542 

.541 

40.53 

.039 

11.20 

MED. 

5 

.0283 

.360 

31 

.261 

5.2 

MAX. 

8 

.1538 

1.00 

176 

.972 

22 
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is  5.   Thus  on  the  average  only  .0542  of  the  virtual  space  of  programs 
needs  to  be  in  primary  memory  to  achieve  an  average  LP  of  .541.   The 
average  number  of  page  frames  needed  in  the  untrans formed  programs 
machine  to  achieve  identical  levels  of  performance  is  40.53.   This  num- 
ber varies  between  a  minimum  of  5  and  176  page  frames.   The  median  is 
31.   On  the  average  11.20  times  more  page  frames  are  needed  in  the  un- 
transformed  programs  machine  than  are  needed  in  the  transformed  programs 
machine.   Note  that  the  paged  machine  running  untransformed  programs 
needs  on  the  average  .309  of  the  virtual  space  of  programs  in  primary 
memory.   This  factor  of  3.24  reduction  of  memory  needed  which  was  achieved 
by  the  introduction  of  paging  to  nonpaged  systems  is  surpassed  by  the 
amount  of  reduction  of  the  memory  needed  in  the  transformed  programs 
machine  from  the  untransformed  programs  machine  (an  average  of  11.20 
compared  to  an  average  of  3.24),  where  both  machines  are  paged. 

Tables  12,  13,  and  14  show  our  second  set  of  comparisons.   In 
these  tables  we  use  m  , ,  m  ,  ,and  m  ~  to  denote  the  memory  allotments  needed 
by  the  untransformed  programs  to  generate  no  less  that  f  (4) ,  f  (6) ,  and  f  (8) 
respectively.  With  4  page  frames  the  transformed  programs  machine  will 
have  on  the  average  an  LP  of  .382  with  a  median  of  .244.   In  Table  12  we 
note  that  the  untransformed  programs  machine  need  on  the  average  29.35 
page  frames  to  achieve  the  same  level  of  performance  with  a  median  of 
12.00  page  frames.   Thus  the  transformed  programs  machine  achieves  an 
average  factor  of  7.34  reduction  in  the  required  memory  to  achieve  this 
level  of  performance  (the  median  is  3.00).  Note  that  on  the  average,  the 
untransformed  programs  machine  is  achieving  a  factor  of  26.25  saving  in 
primary  memory  compared  to  an  unpaged  machine.   The  transformed  programs 
machine  is  achieving  a  factor  of  74.40. 
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Table  12.  Memory  Requirements  of  Transformed  and  Original  Programs  at 
Similar  Performance  Levels  -  the  Transformed  Programs  4 
Pages  Level. 


Program 

LPt(4) 

m  . 
c4 

mc4M 

DP/4 

DP/m  , 
c4 

ADVECT 

.022 

14 

3.50 

56.50 

16.14 

BASE 

.444 

38 

9.50 

75.00 

7.89 

BIGEN 

1.00 

6 

1.50 

96.25 

64.17 

CD 

.244 

11 

2.75 

5.25 

1.91 

DISPERSE 

.763 

60 

15.00 

183.50 

12.23 

FIELD 

.369 

11 

2.75 

13.00 

4.73 

FLR 

.885 

7 

1.75 

5.75 

3.29 

FOURTR 

.003 

5 

1.25 

32.00 

25.6 

GE 

.234 

35 

8.75 

9.00 

1.03 

INIT 

1.00 

64 

16.00 

61.25 

3.83 

LUD 

.0001 

1 

.25 

9.00 

36 

MAIN 

.071 

14 

3.50 

49.50 

14.14 

MAMOCO 

.0101 

4 

1.00 

218.75 

218.75 

MATMUL 

.272 

34 

8.50 

18.75 

2.21 

MATTRP 

1.00 

9 

2.25 

6.25 

2.78 

PAPUAL 

.165 

174 

43.50 

354.50 

8.15 

TWO-WAY 

.0102 

12 

3.00 

70.50 

23.50 

MIN. 

.0001 

4.00 

1.00 

5.25 

1.03 

AVG. 

.382 

29.35 

7.34 

74.40 

26.25 

MED. 

.244 

12.00 

3.00 

49.50 

8.15 

MAX. 

1.00 

174 

43.50 

354.50 

218.75 
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Table  13.   Memory  Requirements  of  Transformed  and  Original  Programs  at 
Similar  Performance  Levels  -  the  Transformed  Programs  6 
Pages  Level. 


Program 

LPt(6) 

mc6 

mc6/6 

DP/6 

DP/mc6 

ADVECT 

.229 

31 

5.17 

37.67 

7.29 

BASE 

.833 

38 

6.33 

50.00 

7.89 

BIGEN 

1.00 

6 

1.00 

64.17 

64.17 

CD 

.280 

11 

1.83 

3.50 

1.97 

DISPERSE 

.885 

64 

10.67 

122.33 

11.47 

FIELD 

.571 

14 

2.33 

8.67 

3.71 

FLR 

.962 

7 

1.17 

3.83 

3.29 

FOURTR 

.133 

65 

10.83 

21.33 

1.97 

GE 

.246 

35 

5.83 

6.00 

1.03 

INIT 

1.00 

64 

10.67 

40.83 

3.83 

LUD 

.237 

22 

3.67 

6.00 

1.64 

MAIN 

.281 

28 

4.67 

33.00 

7.07 

MAMOCO 

.367 

30 

5.00 

145.83 

29.17 

MATMUL 

.272 

34 

5.67 

12.50 

2.21 

MATTRP 

1.00 

9 

1.50 

4.17 

2.78 

PAPUAL 

.165 

174 

29.00 

236.33 

8.15 

TWOWAY 

.039 

17 

2.83 

47.00 

16.59 

MIN. 

.039 

6.00 

1.00 

3.50 

1.03 

AVG. 

.499 

38.18 

6.36 

49.52 

10.25 

MED. 

.281 

30.00 

5.00 

33.00 

3.83 

MAX. 

1.00 

174 

29.00 

236.33 

64.17 
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Table  14.  Memory  Requirements  of  the  Transformed  and  Original  Programs 
at  Similar  Performance  Levels  -  the  Transformed  Programs  8 
Pages  Level. 


Program 


LPt(8) 


mc8 


mc8/8 


DP/8 


DP/mcs 


ADVECT 

.245 

31 

3.88 

28.25 

7.29 

BASE 

.855 

38 

4.75 

37.50 

7.89 

BIGEN 

1.00 

8 

1.00 

48.13 

48.13 

CD 

.349 

11 

1.38 

2.63 

1.91 

DISPERSE 

.  .893 

64 

8.00 

91.75 

11.47 

FIELD 

.855 

19 

2.38 

6.50 

2.74 

FLR 

1.00 

8 

1.00 

2.88 

2.88 

FOURTR 

.155 

65 

8.13 

16.00 

1.97 

GE 

.275 

35 

4.38 

4.50 

1.03 

INIT 

1.00 

64 

8.00 

30.63 

3.83 

LUD 

.234 

22 

2.75 

4.50 

1.64 

MAIN 

.313 

38 

4.75 

24.75 

5.21 

MAMOCO 

.469 

30 

3.75 

109.38 

29.17 

MATMUL 

.272 

34 

4.25 

9.38 

2.21 

MATTRP 

1.00 

9 

1.13 

3.13 

2.78 

PAPUAL 

.99 

176 

22 

177.25 

8.06 

TWOWAY 

.115 

59 

7.38 

35.25 

4.78 

MIN. 

.115 

8 

1.00 

2.63 

1.03 

AVG. 

.592 

41.82 

5.23 

37.20 

8.47 

MED. 

.469 

34.00 

4.25 

28.25 

3.83 

MAX. 

1.00 

176 

8.73 

177.25 

48.73 
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Table  13  shows  similar  data  when  the  transformed  programs 
machine  allots  6  page  frames  to  all  programs.   The  average  LP  is  .499 
(the  median  is  .281).   The  untransformed  programs  machine  needs  on  the 
average  38.18  pages  to  achieve  this  level  of  performance  which  is  an 
average  factor  of  6.36  more  than  the  memory  needed  by  the  transformed 
programs  machine.   On  the  average,  the  untransformed  programs  machine 
is  achieving  a  factor  of  10.25  savings  in  primary  memory  (compared  to  a 
nonpaged  machine)  while  the  transformed  programs  machine  is  achieving  a 
factor  of  49.52. 

Table  14  shows  the  data  when  8  page  frames  are  alloted  to  all 
the  transformed  programs.   The  average  LP  is  .592  (.469  median).   The 
average  memory  needed  by  the  untransformed  programs  is  41.82,  which  is 
an  average  factor  of  5.23  more  page-frames  than  8. 

Thus  from  Tables  10  through  14  it  is  clear  that  with  few  page 
frames  (4  to  8)  the  transformed  programs  have  a  much  lower  rate  of  page 
faulting  (on  the  average  a  factor  of  19.9  lower).   To  achieve  similar 
levels  of  page  faulting,  the  untransformed  programs  need  on  the  average 
a  factor  of  5.23  to  7.34  more  memory  (on  the  average  29.35  to  41.82  page 
frames  compared  with  4-8  page  frames  for  the  transformed  programs). 
4.2.2  The  Space-Time  Cost  vs.  Memory  Allotment  Results 

As  discussed  in  Chapter  2,  the  throughput  of  a  multiprogrammed 
machine  is  inversely  proportional  to  the  average  space-time  cost  of  exe- 
cution of  programs.   Thus  the  concern  here  is  to  reduce  the  space-time 
cost  of  programs.   Moreover,  one  would   like  to  reduce  the  amount  of 
memory  alloted  to  each  program  because  this  will  improve  the  degree  of 
multiprogramming. 
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In  this  section  we  compare  the  space-time  cost  of  executing  un- 
transformed  and  transformed  programs.   Here  we  assume  that  the  OS  uses 
the  local  LRU  replacement  algorithm  and  a  fixed  memory  allotment  policy. 
In  other  words,  when  a  program  is  executed  it  is  assigned  a  fixed  amount 
of  memory.   When  this  program  generates  a  page  fault  the  OS  will  replace, 
if  necessary,  one  of  the  pages  of  the  same  program.   In  later  sections 
of  this  chapter  we  will  discuss  the  implications  of  our  results  when  the 
OS  uses  different  memory  management  strategies. 

Traditionally  people  have  used  the  number  of  memory  references 
made  by  a  program  to  measure  the  time  spent  by  the  CPU  to  execute  the 
program.   If  we  denote  this  number  by  R  then  the  space  time  cost  of  ex- 
ecuting a  program  under  our  assumptions  is  given  by: 

Space-Time  Cost  =  m  *  (R  +  f (m)  *  T)  4.1 

where  m  is  the  number  of  page  frames  alloted  to  the  program,  f (m)  is 
the  number  of  page  faults  and  T  is  the  average  page  fault  service  time 
(in  memory  references) .   With  the  same  m  the  space-time  cost  of  the 
transformed  version  of  the  program  is  given  by: 

(Space-Time  Cost)   =  m  *  (R  +  f  (m)  *  T)  4.2 

We  note  that  equations  4.1  and  4.2  have  a  common  term  m*R.   If 
we  plot  the  curves  representing  these  equations  (versus  memory  allot- 
ment) then  this  term  is  a  common,  bias  to  both  curves.   The  bias  term  of 
the  space-time  cost  of  a  program  is  independent  of  its  degree  of  locality. 
The  locality  of  the  program  affects  only  the  non-bias  term.   Thus  to 
compare  the  improvement  in  the  locality  of  programs  one  needs  to  compare 
only  the  non-biased  space-time  costs  of  the  original  and  transformed 
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programs.   This  is  in  some  way  analogous  to  measuring  the  voltage  gain 
of  an  amplifier  by  the  ratio  of  the  AC  output  voltage  to  the  AC  input 
voltage. 

In  the  Appendix  we  show  the  space-time  cqst  curves  for  our  pro- 
grams after  removing  the  bias  terms.   These  curves  are  also  independent 
of  the  value  of  T,  the  page  fault  service  time.   We  have  normalized 
these  curves  by  making  T  equal  to  one  unit  time,  i.e.,  one  unit  of  the 
space-time  cost  is  equal  to  a  page  frame-page  fault  service  time.   Thus 
the  curves  represent  m*f(m)  and  m*f  (m)  for  the  original  and  transformed 
programs  respectively.   We  denote  these  two  functions  by  ST(m)  and  ST  (m) . 
Note  that  the  difference  between  ST (m  )  and  ST  (m_)  is  equal  to  the 
difference  between  the  total  values  of  the  space-time  costs  when  m..  =  m„ . 
However,  if  m  >  m  then  ST(m  )  -  ST  (m„)  is  less  than  the  difference 
between  the  total  values  of  the  space-time  cost.   This  is  because  the 
bias  term,  m*R,  increases  as  m  is  increased  and  hence  it  will  be  greater 
for  ST(m  )  than  for  ST  (m  ) .   Thus  the  comparisons  which  we  will  make 
shortly  are  on  the  conservative  side  (we  will  be  comparing  ST(m..)  and 
ST  (m  )  with  either  m1=m_  or  m,  >  m  ).   In  other  words  our  results  would 
have  been  better  if  we  plotted  the  total  values  of  the  space-time  cost 
functions.   In  the  rest  of  this  thesis,  unless  otherwise  specified,  we 
use  the  term  space-time  cost  to  mean  the  total  space-time  cost  minus  the 
bias  term.   Thus  for  the  original  programs  the  space-time  cost  will  be 
given  by  the  ST(m)  function  and  for  the  transformed  programs  by  the 
ST  (m)  function. 

Both  the  ST  and  ST  curves  have  absolute  minimums .   We  will  use 

M   to  denote  the  memory  allotment  at  the  minimum  point  of  the  ST  curve, 
o 
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Similarly  we  use  M   to  denote  the  memory  allotment  at  the  minimum  point 
of  the  ST  curve.   Table  15  shows  M  's  for  all  our  programs.   We  note 
that  M  ranges  between  1  and  67  with  an  average  of  24.8  and  a  median  of 
24.  There  are  6  programs  with  M  <  10,  7  programs  with  M  >  30,  and  4 
programs  with  10  <  M  £  30.   In  each  of  these  three  sets  of  programs  M 
is  spread  over  the  range  of  the  set.   In  the  first  range  M  takes  the 
values  1,  1,  6,  6,  8,  and  9.   In  the  second  set  the  values  are  13,  20,  24, 
and  28.   In  the  third  set  the  values  are  31,  32,  36,  39,  41,  60,  and  67. 

Thus  the  first  important  observation  we  make  is  that  M  's  of 
the  original  programs  are  well  scattered  over  a  wide  range. 

Another  important  observation  which  we  make  is  that  the  ST 

curves  are  not  well  behaved  for  m  <  M   (see  the  Appendix).  For  some  parts 

o 

of  this  memory  range  ST  increases  with  m  for  others  it  decreases.   More- 
over, often  sudden  jumps  in  the  value  of  ST  are  encountered.   In  other 
words  the  ST  curves  wiggle,  going  up  and  down  for  m  <  M  .   For  m  >  M 
the  ST  functions  are  rather  linearly  increasing  with  m.   Since  M  is 
scattered  over  a  wide  range,  it  is  impossible  to  choose  a  narrow  band  of 
memory  allotment  in  which  all  programs  will  run  efficiently,  i.e.  with 

ST  values  close  to  ST(M  ). 

o 

2 
In  Table  15  we  also  show  the  ratios  M  /DP  and  ST(M  ) /DP  ,  where 

o  o 

DP  is  the  number  of  distinct  pages  referenced.   These  are  intended  to 
give  a  feeling  for  the  potential  advantage  that  paged  virtual  memory 
machines  have  over  non-virtual  memory  machines.   If  a  program  is  alloted 
a  number  of  page  frames  equal  to  its  M  ,  then  on  the  average  it  will  be 
using  only  .303  of  the  memory  it  needs  in  a  non-virtual  memory  machine 
and  its  space-time  cost  will  be  only  .388  of  the  cost  in  the  non-virtual 
memory  machine. 
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Table  15.   Characteristics  of  the  Minimum  Space-Time  Cost  Points  of 
the  Original  Programs. 


Program 


Mo 


Mo /DP 


ST(MQ)/DP' 


ADVECT 

BASE 

BIGEN 

CD 

DISPERSE 

FIELD 

FLR 

FOURTR 

GE 

INIT 

LUD 

MAIN 

MAMOCO 

MATMUL 

MATTRP 

PAPUAL 

TWOWAY 


32 
39 

6 
13 

1 
20 

8 
67 
36 

6 
24 
28 
31 
41 

9 

1 
60 


.1416 
.1300 
.0156 
.6190 
.0013 
.3846 
.3478 
.5234 
1.000 
.0244 
.6667 
.1414 
.0354 
.5467 
.3600 
.0007 
.2127 


.2218 
.1300 
.0156 
.6485 
.0186 
.4215 
.3478 
.6583 
1.000 
.0846 
.6587 
.4900 
.0357 
.5467 
.3600 
.0167 
.9431 


MIN. 
AVG. 
MAX. 
MED. 


1 
24.8 
67 
24 


.0007 
.303 
1.0 
.2127 


.0156 
.388 
1.0 
.3600 


167 

The  space-time  cost  curves  of  the  transformed  programs  have  a 

much  better  behavior.   The  minimum  points  in  the  ST  curves  occur  at 

memory  allotments  which  fall  in  a  much  narrower  band.   Table  16  shows 

the  M   's  of  our  programs.   We  note  that  all  the  transformed  programs 

have  1  <  M   <  8.   There  are  3  programs  with  M   =  8,4  with  M   =  6,  2 
ot  r   o  ot  ot 

with  M  _  =  5,  4  with  M   =3,3  with  M  _  =  2,  and  one  program  with  M  _  =  1. 
ot  ot  ot  ot 

The  average  M   is  4.53  and  the  median  is  5.   The  implications  of  the 
ot 

difference  in  the  range  of  M  and  M   and  in  the  behavior  of  the  ST  and 

o      ot 

ST  curves  will  be  discussed  shortly. 

2 
In  Table  16  we  also  show  M   /DP  and  ST  (M  )/DP  .   On  the  average, 

ot  t  ot 

when  a  transformed  program  is  alloted  a  number  of  page  frames  equal  to 
its  M  then  it  will  be  using  .0542  of  its  virtual  space  (which  is  the 
same  as  the  virtual  space  of  the  untransformed  program)  and  it  will  be 
costing  only  .1822  its  cost  on  a  non-virtual  memory  machine. 

Table  17  compares  the  optimum  ST  and  ST  points.   On  the  average 
an  untransformed  program  needs  5.66  more  primary  memory  to  achieve  its 
minimum  space-time  cost.   Moreover,  the  minimum  cost  of  an  untransformed 
program  is  on  the  average  4.04  more  than  the  minimum  cost  of  the  trans- 
formed programs.   Note  that  if  the  untransformed  program  was  alloted  M 
page  frames  then  it  will  cost  (on  the  average  )  29.84  more  than  the  trans- 
formed program  cost. 

Although  comparing  the  optimum  ST  and  ST  points  does  serve  the 

purpose  of  showing  the  effectiveness  of  our  transformations  in  improving 

the  behavior  of  programs  and  reducing  their  execution  costs,  it  is  still 

more  interesting  to  make  comparisons  under  more  practical  assumptions. 

The  point  is  that  an  OS  has  no  means  of  determining  the  values  of  M  or 

M  ,  and  hence  we  cannot  expect  an  untransformed  program  to  run  with  M 
ot  r  r  o  Q 
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Table  16.   Characteristics  of  the  Minimum  Space-Time  Cost  Points 
of  the  Transformed  Programs. 


Program 


M, 


ot 


Mot/DP 


STt(Mot)/DP' 


ADVECT 

6 

BASE 

5 

BIGEN 

2 

CD 

3 

DISPERSE 

3 

FIELD 

8 

FLR 

2 

FOURTR 

6 

GE 

3 

INIT 

1 

LUD 

6 

MAIN 

5 

MAMOCO 

6 

MATMUL 

3 

MATTRP 

2 

PAPUAL 

8 

TWO-WAY 

8 

MIN. 

1 

AVG. 

A. 53 

MAX. 

8 

MED. 

5 

.0265 
.0167 
.0052 
.1429 
.0041 
.1538 
.0870 
.0468 
.0833 
.0041 
.1667 
.0252 
.0068 
.0400 
.0800 
.0056 
.0283 


.1157 

.0203 

.0059 

.6190 

.0054 

.1804 

.1059 

.5315 

.3634 

.0041 

.722 

.1052 

.0190 

.1467 

.080 

.0057 

.2467 


.0041 
.0542 
.1667 
.0283 


.0041 
.1822 
.722 
.1059 
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Table  17.   Comparing  the  Minimum  Space-Time  Cost  Points  of  the 
Original  and  Transformed  Programs. 


Program        M0/Mot         ST(Mot)/STt(Mot)         ST(Mo)/STt(Mot) 


36.18  1.917 

54.00  6.376 

40.49  2.361 

46.32  1.047 

9.41  3.477 

38.72  2.336 
7.5  3.286 

40.25  1.872 

53.73  2.751 

42.29  20.74 

34.97  .949 

7.32  4.656 

3.52  1.881 

58.97  3.727 

7.72  4.5 

7.87  2.923 

17.46  3.837 

3.52  1.05 

29.84  4.04 

58.97  20.74 

36.78  2.92 


ADVECT 

5.3 

BASE 

7.8 

BIGEN 

3 

CD 

4.3 

DISPERSE 

.3 

FIELD 

2.5 

FLR 

4 

FOURTR 

11.17 

GE 

12 

INIT 

6 

LUD 

4 

MAIN 

5.6 

MAMOCO 

5.2 

MATMUL 

13.7 

MATTRP 

4.5 

PAPUAL 

.125 

TWOWAY 

7.5 

MIN. 

.125 

AVG. 

5.66 

MAX. 

13.7 

MED. 

5.2 
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page  frames  or  a  transformed  program  to  run  with  M   page  frames.   Thus 

ot 

the  comparison  at  the  optimum  ST  and  ST  points  is  probably  only  of 
academic  theoretical  interest.   Although  we  do  not  wish  at  this  point  to 
discuss  some  particular  existing  OS's,  we  want  to  make  some  comparisons 
under  assumptions  which  are  closer  to  what  happens  in  the  real  world. 

We  will  make  two  sets  of  comparisons.   In  the  first  set  we 
compare  ST  to  ST  when  both  the  transformed  and  untransformed  programs 
are  allocated  similar  memory  allotments  (4  <  m  <  8) .   This  type  of 
comparison  will  show  us  the  reduction  of  the  space-time  cost  which  our 
transformations  achieve  if  the  OS  uses  the  policy  of  alloting  a  small 
fixed  number  of  page  frames  for  all  programs.   In  the  second  set  of 
comparisons  we  show  that  on  the  average,  the  cost  of  a  transformed  pro- 
gram when  alloted  a  number  of  page  frames  in  the  range  4  to  8  is  much 
less  (an  order  of  magnitude)  than  the  cost  of  the  untransformed  program 
even  if  it  is  alloted  a  number  of  page  frames  from  a  much  larger  range 
(12  <  m  <  48).   Here  we  will  be  comparing  ST  at  m=4,  6,  and  8  to  ST  at 
memory  allotments  in  the  range  12  <  m  <  48  with  an  increment  of  4  page 
frames . 

Since  at  a  fixed  memory  allotment,  m  =  m  ,  we  have: 

ST(m  )/ST  (m  )  =  m  *f(m  )/m  *f  (m  )  =  f(m  )/f  (m  ) 
a    ta     a    aata       ata 

then  the  results  of  comparing  ST  to  ST   at  similar  memory  allotments  in 

the  range  4  <  m  <  8  are  identical  to  those  shown  in  Table  10.   Thus 

all  our  previous  discussion  about  the  improvements  in  page  faults  for 

this  memory  range  apply  directly  to  the  improvements  achieved  in  the 

space-time  cost.   Hence,  on  the  average  the  transformed  programs  will 

have  19.9  times  less  space-time  cost  than  the  untransformed  programs  when  all 
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programs  are  assigned  a  fixed  memory  allotment  in  the  range  A  to  8 

page  frames. 

Tables  18,  19,  and  20  show  our  second  set  of  comparisons.   In 

Table  18-a  we  show  for  all  our  programs  the  ratio  ST(m)/ST  (4),  where 

12  <  m  <  48.   Note  that  we  do  not  make  the  comparison  for  a  program  at 

any  m  which  is  greater  than  DP  of  the  program.   We  observe  that  for  most 

programs  and  for  most  memory  allotments  we  have  ST  (4)  <  ST(m).   This 

is  not  true  for  program  ADVECT  with  32  <  m  <  48.   This  is  because  for 

ADVECT  M  =  32  and  M   =6.   When  some  more  memory  is  given  to  the 

transformed  version  of  ADVECT  (6  or  8  page  frames)  ST  will  be  less  than 

ST(m)  for  any  12  <  m  <  48  (Tables  19-a  and  20-a) .   Similar  remarks  apply 

to  program  MAMOCO.   In  Table  20-a  we  note  that  programs  CD  and  LUD  are 

the  only  two  programs  for  which  STfc(8)  is  greater  than  ST(m)  for  some  m, 

12  ^  m  <  48.   The  ratio  ST/ST  improves  as  the  transformed  versions  of 

these  two  programs  are  given  less  pages.   This  is  because  M   for  CD  is  3 

and  for  LUD  is  6.   From  Tables  18-a,  19-a,  and  20-a  it  seems  that  an  OS 

can  use  the  simple  rule  of  allocating  4  pages  to  the  transformed  programs 

with  relatively  small  DP  (say  less  than  100  or  75  page  frames)  and  8 

page  frames  to  those  with  larger  DP's.   In  this  case  the  transformed 

programs  will  (in  almost  all  cases)  cost  less  to  execute  than  the  original 

programs  no  matter  how  much  memory  is  assigned  to  the  untransf ormed 

programs.   (Note  that  it  is  not  our  purpose  here  to  determine  the  exact 

values  of  such  numbers  as  4  page  frames  for  programs  with  DP  <  100, 

otherwise  8  page  frames.   More  programs  and  more  detailed  studies  need  to 

be  done  in  order  to  determine  such  numbers.   However,  using  statistics 

available  from  large  collections  of  Fortran  programs  and  arguments 
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about  the  number  of  statements  in  a  TT-block  and  the  number  of  operands 
per  statement,  we  are  inclined  to  believe  that  our  numbers  are  close  to 
being  accurate.) 

Tables  18-b,  19-b,  and  20-b  give  some  statistics  about  Tables 
18-a,  19-a,  and  20-a  respectively.  With  4  page  frames,  a  transformed 
program  will  have  on  the  average  between  6.09  and  10.8  times  less  space 
time  product  than  the  untransf ormed  program  when  executed  with  a  memory 
allotment  in  the  range  12  <  m  <  48.   The  memory  reduction  is  between  a 
factor  of  3  and  12  with  an  average  of  7.5  and  a  median  of  7.5.   Note 
that  the  median  of  the  reduction  in  the  space-time  cost  ranges  between 
3.54  and  8.30  with  an  average  of  5.32.   The  average  of  the  averages  of 
the  improvement  in  the  space-time  cost  is  8.81. 

In  Table  19-b,  with  6  page  frames  the  average  improvement  in  the 
space-time  cost  ranges  between  8.94  and  12.88  with  an  average  of  11.49. 
The  median  of  the  improvement  ranges  between  4.42  and  7.78  with  an 
average  of  6.31.   The  reduction  in  memory  ranges  between  a  factor  of 
2.00  and  8.00  with  an  average  and  a  median  of  5.00. 

In  Table  20-b  the  transformed  programs  are  assigned  8  pages. 
The  average  reduction  of  the  space-time  cost  ranges  between  8.39  and 
13.68  with  an  average  of  11.73.   The  median  of  the  improvement  ranges 
between  4.54  and  7.45  with  an  average  of  6.03.   The  reduction  in  memory 
ranges  between  1.50  and  6.00  with  an  average  and  a  median  of  3.75. 

From  Tables  18-b,  19-b  and  20-b  one  can  say  that  when  trans- 
formed programs  are  executed  with  memory  allotments  in  the  range  4  to  8 
pages  they  have  less  space-time  cost  than  the  untransformed  programs 
by  an  average  factor  of  10.68  (this  is  the  average  of  8.81,  11.49  and 
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and  11.73).   To  achieve  this  improvement,  a  transformed  program  will  be 
executing  with  a  memory  which  is  on  the  average  5.42  less  than  the  memory 
alloted  to  the  untransf ormed  program.   Thus  in  a  multiprograming 
system  our  program  transformations  can  result  potentially  in  an  order 
of  magnitude  improvement  in  the  throughput  with  an  increase  in  the  degree 
of  multiprograming  of  more  than  a  factor  of  5. 

4 .3  Measuring  the  Performance  Improvement  of  Paged  Virtual  Memory 
Systems  -  the  Variable  Memory  Allotment  Case 

Most  existing  virtual  memory  multiprogrammed  systems  use  memory 
management  policies  that  vary  the  memory  alloted  to  a  program  during 
its  execution.   Here  we  choose  the  working  set  policy  to  represent 
variable  memory  allotment  policies  [DENN68] .   Other  policies  are  varia- 
tions and  approximations  to  the  working  set  policy.   Our  interest  is  in 
finding  the  effect  of  our  transformations  on  the  space-time  cost  of 
executing  programs  under  the  working  set  memory  management  policy. 

Several  studies  have  shown  that  variable  memory  allotment 
policies  are  superior  to  fixed  memory  allotment  policies  like  the  LRU 
[CHU72] ,[COFF72] JDENN75] .   The  main  reason  behind  the  superiority  of 
the  variable  memory  allotment  policies  is  because  the  main  memory  require- 
ment of  a  program  may  change  drastically  during  its  execution.   While 
fixed  memory  allotment  policies  assign  to  a  program  the  same  amount  of 
memory  during  its  entire  execution  time,  variable  memory  allotment 
policies  try  to  adapt  the  memory  alloted  to  a  program  to  the  changing 
size  of  its  locality  sets. 

The  working  set  policy  keeps  in  memory  pages  referenced  during 
the  previous  t  references.   This  set  of  pages  is  called  the  working  set 
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and  is  denoted  at  time  t  by  W(t,T).   T  is  the  window  size.   The  size 

of  the  working  set  at  time  t  is  denoted  by  w(t,x).   From  the  results 

of  our  experiments  reported  in  Section  4.1,  it  is  obvious  that  the 

changes  of  the  sizes  of  the  locality  sets  of  a  transformed  program  are 

much  less  than  these  changes  in  an  untransformed  program.   Hence  it  is 

interesting  to  see  whether  the  working  set  policy  is  any  better  than 

the  LRU  policy  for  the  transformed  programs.  For  untransformed  programs, 

it  seems  that  enough  previous  work  was  done  to  show  that  variable 

memory  allotment  policies  are  better.  More  work  on  these  lines  seems 

to  be  insignificant.   Thus  our  interest  is  to  compare  the  space-time 

cost  of  the  transformed  programs  under  the  LRU  and  the  working  set  policy 

(WS). 

Under  the  LRU  policy  one  can  plot  the  space-time  cost  as  a 

function  of  memory  allotment.  Under  the  WS  policy,  however,  the  memory 

alloted  to  a  program,  i.e.  its  working  set  size,  w(t,i),  varies  during 

its  execution.   Thus  in  order  to  make  a  comparison  to  the  space-time 

cost  under  LRU,  one  needs  to  calculate  the  average  memory  alloted  to  the 

program  during  its  execution  using  the  WS  policy.   With  a  given  window 

size  T,  a  program  trace  of  length  R  references  will  generate  f  (t)  page 

w 

faults.   Let  w. (t . ,x)  be  the  working  set  size  when  the  ith  page  fault 

occurs,  1<  i  <  f  (t).   Then  if  we  denote  the  page  fault  service  time  by 
w 

T,  the  average  memory  alloted  to  the  program  is  given  by 

R  fw(T) 

M(T)  =  (  I   w(t,T)  +  T  *  I     w.(t.,x))/(R  +  T  *  f  (T)) 
t=l  i=l  W 

By  varying  T  one  is  supposed  to  get  different  M(t)  and  ST  (x),  the 

w 

space-time  cost  under  WS ,  and  hence  make  a  plot  of  ST  (t)  versus  M(t) 

w 

which  car  then  be  compared  to  the  space-time  cost  curve  under  LRU. 
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When  collecting  data  for  the  ST  (x)  curves  we  found  that  several 

w 

programs  exhibited  anomalous  behavior  under  WS .   Recently  Franklin,  Graham, 
and  Gupta  have  discovered  by  experimentation,  anomalies  with  the  page 
fault  frequency  replacement  algorithm  [FRAN78].   In  the  same  paper  they 
pointed  out  that  for  some  reference  strings  and  some  t's  the  WS  policy 
can  also  have  anomalous  behavior.   They  called  these  anomalies  the 
parameter  (x)-real  memory  and  real  memory-fault  rate  anomalies.   In  the 
paper  a  short  reference  string  was  constructed  to  illustrate  the  anomalies 
with  the  WS  policy. 

These  are  the  same  anomalies  that  we  found  experimentally  for 
some  of  our  transformed  programs.   Namely,  the  parameter  of  the  working 
set  policy  t  did  not  have  a  consistent  relation  to  the  average  real 
memory  alloted  to  a  program.   One  expects  that  the  average  memory  allot- 
ment should  be  a  nondecreasing  function  of  T.   In  otherwords,  given 
T   and  T„,  if  T  >  T   then  it  is  expected  that  M(t„)  ;>  M(t,).   Similarly 
one  expects  the  number  of  page  faults  generated  under  WS  to  be  non- 
increasing  with  the  average  alloted  memory,  i.e.  if  M(t„)  >  M(t-..)  then 

it  is  expected  to  have  f  (t„)  <  f  (t, ) .   That  the  WS  policy  should 

w  2.  w   1 

possess  these  properties  is  essential  to  be  able  to  control  the  performance 
of  a  multiprogrammed  system  by  changing  the  parameter  T.   As  it  is  put 
in  [FRAN78],  "...Load  control  is  attempted  by  varying  the  paging  algorithm 
parameter.   A  load  control  based  on  an  anomalous  performance  measure  may 
be  unstable  because  a  change  of  given  sign  in  the  parameter  need  not 
produce  changes  of  corresponding  sign  in  the  controlled  variable." 

For  several  of  our  programs  we  noticed  that  for  some  T_  >  t 
we  get  M(t  )  <  M(t  ).   This  is  the  parameter-average  real  memory 
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allotment  anomaly.  Moreover,  for  M(xn)  >  M(x„)  we  noticed  that  f  (x  )  > 

1       2.  w  1 

f  (x~).   This  is  the  average  real  memory  allotment-page  fault  anomaly. 

To  find  the  average  real  memory  allotment  we  had  to  choose  a 
value  for  T,  the  page  fault  service  time.   For  a  page  size  of  64  words, 
we  have  chosen  to  use  three  different  values  of  T:   32  references,  320 
references,  and  3200  references  (1/2  page  size,  5  page  sizes,  and  50 
page  sizes) .   The  3200  value  seems  to  reflect  a  64  word  page  fault  ser- 
vice time  between  disc  and  primary  memory.   The  32  references  seems  to 
reflect  a  64  words  page  fault  service  time  between  an  interleaved 
primary  memory  and  a  fast  cache  memory.   Page  fault  service  time  between 
CCD's  and  primary  memory  seem  to  fall  between  these  two  extremes  [JULI78] 
Since  our  main  aim  was  to  compare  the  space-time  cost  of  the  trans- 
formed programs  under  LRU  and  WS,  we  have  chosen  values  of  T,  the 
window  size,  in  different  ranges  and  with  different  increments  so  as  to 
get  M(t)  in  the  relevant  range  of  the  LRU  space-time  curve  for  each  pro- 
gram.  Generally  speaking  we  used  2  <  t  <  8  with  an  increment  of  1  to 
give  us  M(t)  in  the  range  1  <  M(x)  <  5  and  we  used  x  >  16  by  increments 
of  8,  32,  64,  or  128  to  give  us  M(x)  >  5.   The  selection  of  the  intial 
value  of  T  and  its  increment  was  tuned  in  every  program  to  cover  the 
range  of  M(t)  of  interest. 

Table  21  shows  the  anomalous  behavior  of  WS  which  we  discovered 

in  5  of  our  17  transformed  programs.   M  (x)  ,  M~(x),  and  M  (x)  are  the 

average  alloted  memory  with  the  three  values  of  the  page  transfer  time 

used:  32,  320,  and  3200  references  respectively.   We  notice  that  there 

is  a  significant  difference  between  f   (x,)  and  f   (x„) .   Thus  depending 

tw   1       tw   2 

on  the  page  fault  service  time,  when  the  value  of  x  is  increased  from 
x   to  x  ,  the  reduction  in  the  number  of  page  faults  might  be  big  enough 


en 
c 


o 

3 


X 

0 

CO 

• 

)-< 

en 

ao  E3 

o 

M 

U 

Cu 

OJ 

-o 

T3 

c 

OJ 

3 

p 

H 

0 

o 

■+4 

•H 

09 

> 

C 

03 

n) 

A 

M 

<D 

H 

CQ 

CM 

0) 

H 


CM 


CO 

g 


g 


CM 

H 

CM 

g 


CM 

g 


g 


CM 


182 


C* 

vt 


CM 


CO 
O 


CO 


CO 


CM 


m 


CO 
CO 


CM 


00 


\£> 


O 
O 


CM 

ON 


CM 


CM 
00 


CM 


00 


CM 


cr. 


oo 


CO 


St 

vO 


CM 


O 
st 


CO 


v£> 


u 

Q 

00 

w 

hJ 

o 

00. 

w 

u 

«3 

o 

M 

Pm 

PQ 

u 

U-i 

CT\ 


CTi 


CO 


St 
O 


CM 


CO 


00 


v£> 


st 


O 


00 
00 


ON 

m 


vO 

CO 

vO 

vO 

CO 

o> 

r-« 

00 

o 

00 

00 
st 


m 


st 

CM 

CO 

st 

00 

r^ 

O^ 

CO 

O 

r^ 

co 

i-4 

O 

00 

rH 

st 

.H 

st 

CO 

00 

o 

v£> 

CM 

CO 

vO 

o> 

CM 

O 

o> 

r-i 

iH 

vO 

o 

v£> 


00 


183 


to  make  the  drop  in  the  space-time  integral  greater  than  the  drop  in  time, 
Thus  the  average  memory  allotment  will  be  decreased  rather  than  increased, 
In  general  if  T  is  increased  from  T   to  T~,  then  in  order  for  the 
anomaly  to  exist  we  must  have: 

w 

((  I   w(t,T1)  +  T  *  I   w.(t.,T1)y(R  +  T  *  fw(T1))  > 
t=1  i=l 

W 

((  I   w(t,T  )  +  T  *  I   w.(t.,T  ))/(R  +  T  *  f  (T  )) 
t=l  i=l  W 

Thus  the  existence  of  the  anomaly  depends  on  the  program,!..,  T_,  and  T. 

We  do  not  see  an  obvious  way  of  explaining  the  dependence  of  the  anomaly 

on  each  individual  one  of  these  factors.   The  four  factors  interact  to 

produce  the  anomaly.   In  [FRAN78]  an  argument  was  presented  to  support 

a  theory  that  when  the  anomaly  occurs  for  a  given  program,  T..  ,  and  t 

then  there  exists  a  crossover  value  of  T  =  T  such  that  the  anomaly  will 

c 

occur  for  all  T  >  T  .   Our  experiments  have  shown  that  this  theory  is 
not  valid.   For  example  in  programs  CD  and  FIELD  the  anomaly  occurs  for 
T  =  32  and  T  =  320  but  it  does  not  occur  for  T  =  3200. 

For  all  our  transformed  programs  we  noted  that  the  anomaly 
either  does  not  exist  or  it  occurs  at  values  of  M(t)  which  are  less  than 
M   ,  the  memory  allotment  at  the  minimum  space-time  point  under  LRU.   We 
found  that  for  all  those  programs  which  are  anomaly  free  there  was  no 
difference  between  the  space-time  cost  under  LRU  and  WS  in  any  memory 
range  and  for  the  three  values  chosen  for  T.   For  the  5  programs  which  ex- 
hibited the  anomalous  behavior,  there  was  no  difference  between  the 
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space-time  cost  under  LRU  and  WS  for  memory  allotments  greater  than  M 

ot 

For  memory  allotments  less  than  M   the  anomaly  existed  and  no  compari- 
son can  really  be  made.   Note  that  when  we  say  there  was  no  difference 
between  the  cost  under  LRU  and  WS  we  mean  that  one  cannot  really  draw 
two  different  curves  to  represent  the  LRU  and  WS  space-time  cost  func- 
tions.  In  Figures  27,  28,  and  29  we  show  the  space-time  cost  for  two 
programs  which  have  the  anomaly  (CD  and  BASE)  and  for  one  program  which 
is  anomaly  free  (MATMUL) .   We  will  not  show  curves  for  any  more  programs 
because  they  do  not  reveal  any  additional  interesting  information. 

Because  of  our  observation  that  the  anomaly  occured  at  values 
of  memory  allotments  less  than  M   (which  might  be  interpreted  by  some 
people  to  mean  that  the  anomaly  only  shows  for  some  programs  when  they 
are  thrashing,  whatever  the  definition  of  thrashing  might  be)  we  did 
some  more  experimentation  to  see  whether  this  is  always  true.   We  generated 
the  space-time  cost  functions  under  the  WS  policy  for  7  of  our  untrans- 
formed  programs,  namely  ADVECT,  BASE,  BIGEN,  DISPERSE,  FOURTR,  INIT,  and 
PAPUAL.   The  anomaly  showed  in  3  of  these  programs;  INIT,  DISPERSE,  and 
FOURTR.   For  program  INIT  the  anomaly  occurred  at  memory  allotments 
below  and  above  M  (For  INIT  M  =6).   For  program  DISPERSE  the  anomaly 
occurred  at  memory  allotments  greater  than  M  =  1.   For  the  FOURTR  pro- 
gram the  anomaly  occurred  at  memory  allotments  less  than  M  =67.   As  a 
matter  of  fact  we  did  not  check  whether  it  also  occurs  at  allotments 
greater  than  67  (Remember  that  these  experiments  are  very  costly  because 
the  trace  has  to  be  scanned  once  for  every  value  of  T.   We  could  not  find  in 
the  literature  any  algorithm  for  calculating  the  real  average  memory 
allotments  for  different  x's  in  one  scan  of  the  trace.   Moreover,  from 
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Figure  28-a.   The  Space-Time  Cost  of  Program  CD  (Transformed), 
T  =  32  References 
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Figure  28-b.   The  Space-Time  Cost  of  Program  CD  (Transformed), 
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Figure  28-c.   The  Space-Time  Cost  of  Program  CD  (Transformed),  T  =  3200 
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Figure  29-a.   The  Space-Time  Cost  of  Program  MATMUL  (Transformed), 
T  =  32 
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Figure  29-b.   The  Space-Time  Cost  of  Program  MATMUL  (Transformed), 
T  =  320 
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Figure  29-c.   The  Space-Time  Cost  of  Program  MATMUL  (Transformed),  T  =  3200 
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our  own  investigation  of  this  matter  we  reached  a  conclusion  that  one 

needs  to  save  so  much  information  when  going  through  the  trace  to  cal- 
culate the  real  average  memory  for  different  x's,  that  it  is  probably 
cheaper  and  much  simpler  to  go  through  the  trace  several  times.   To 
locate  anomalies  one  ideally  needs  to  start  at  T  =  1  and  increase  it  by 
increments  of  1.   This  is  really  a  prohibitive  expense  even  for  short 
traces.  Most  probably,  this  is  the  reason  why  the  working  set  anomaly 
was  not  discovered  for  more  than  ten  years  since  the  introduction  of 
the  working  set  policy  [DENN68] .   Most  probably  this  is  also  why  nobody 
else  has  tried  to  investigate  this  anomaly  in  real  programs  to  date) . 
Table  22  summarizes  our  findings  concerning  the  anomalies  in  the  untrans- 
formed  versions  of  programs  INIT ,  DISPERSE,  and  FOURTR. 

Note  that  in  our  previous  conclusion  there  was  no  difference 
between  the  space-time  cost  of  executing  a  program  under  the  LRU  or  the 
WS  policies;  we  are  using  the  average  behavior  of  the  program  under  the 
WS  to  make  the  comparison.   In  fact  it  should  be  clear  that  the  LRU  policy 
is  a  better  policy  for  transformed  programs.   If  one  plots  the  memory 
alloted  to  a  transformed  program  as  its  execution  progresses  in  real 
time,  the  WS  curve  will  have  sharp  peaks  whenever  the  program  changes 
localities.   The  LRU  curve,  however,  stays  at  the  same  level  for  the 
entire  execution  time  of  the  program.   Although  the  WS  sharp  peaks  are 
usually  short,  they  can  still  cause  serious  problems  in  a  multiprogrammed 
system.   If  no  free  page  frames  are  available  when  such  excessive  demand 
for  memory  occurs,  other  programs  may  be  deactivated.   In  [SMIT76]  re- 
ducing the  seriousness  of  this  problem  is  approached  by  making  the  WS 
policy  more  elaborate  and  introducing  a  second  parameter  for  the  policy. 
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Smith  called  his  modified  WS  algorithm  the  "Damped  Working  Set  Algorithm, 
DWS."  We  will  not  discuss  the  DWS  algorithm  and  refer  the  interested 
reader  to  [SMIT76].   The  point  is  that  for  transformed  programs  a  simple 
algorithm  like  the  LRU  can  achieve  a  level  of  performance  which  is  as 
good  as  the  average  performance  achieved  by  a  more  elaborate  algorithm 
(more  costly  to  implement)  like  the  WS  which  suffers  from  anomalous 
behavior  for  some  programs  and  needs  more  tuning  to  avoid  the  serious 
problem  of  the  peaks  in  the  memory  allotments  during  the  execution  of  a 
program. 
4 .4   Summary 

The  preliminary  results  presented  in  this  chapter  show  that 
our  transformation  techniques  are  very  promising.   The  transformations 
have  succeeded  in  making  programs  behave  better,  cost  less,  and  they 
seem  to  abolish  the  need  for  fancy  memory  management  policies.   A  simple, 
easy  algorithm  like  the  LRU  seems  to  do  very  well. 

The  only  point  that  needs  clarification  is  how  sensitive  are 
our  results  and  conclusions  to  the  page  size  which  we  have  chosen,  namely 
256  bytes  or  64  words.   This  clarification  is  necessary  because  most 
existing  virtual  memory  systems  use  a  page  size  which  is  in  many  cases 
at  least  4  times  as  large  as  our  pages  and  more.   On  the  other  hand,  the 
sizes  of  the  arrays  referenced  in  the  programs  may  increase.   This  leads 
to  increasing  the  loop  limits  in  the  program.   From  talking  to  some  of 
the  sources  of  our  programs  we  learned  that  the  sizes  of  the  arrays  used 
in  their  programs  might  easily  grow  by  a  factor  of  10.   In  other  cases 
where  we  coded  programs  like  the  Gaussian  Elimination  program,  we  used 
an  array  of  size  48x48.   In  many  applications,  like  civil  engineering 
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for  example,  the  size  of  the  system  of  equations  solved  is  much  larger 
than  this.   Hence  we  need  also  to  discuss  the  effect  of  increasing  the 
array  sizes  on  our  results.   We  spend  the  rest  of  this  section  investi- 
gating the  effect  of  varying  the  page  size  and  array  sizes  on  the  page 
fault  and  space-time  cost  curves.   The  sensitivity  of  the  rest  of  our 
results  (locality  sizes  etc.)  follow  similar  lines. 

First  we  discuss  transformed  programs.   Loops  in  transformed 
programs  follow  the  ELM  model.   For  a  loop  that  follows  the  ELM,  the 
critical  memory  allotment  m  is  0(//of  different  array  names).   Thus  m 
is  independent  of  the  array  sizes  or  the  page  size  (as  long  as  the  arrays 
are  multi-page  arrays) .   With  m  >  m  ,  the  number  of  page  faults  f   is 
0(K),  where  K  is  the  number  of  pages  per  array.   For  m  <  m  ,f  is  0(1 
of  words  per  array) .   When  a  loop  is  alloted  m  <  m  we  will  say  that  the 
loop  is  thrashing.   Thus,  to  simplify  the  discussion,  if  we  assume  only 
one  dimensional  arrays  of  size  N,  then  for  m  >  m  ,  f^is  0(N/Z)  and  for 
m  <  m  , f  is  0(N)(see  Chapter  2  for  more  details  about  these  points). 

We  consider  different  possibilities.   In  the  first  case  let  us 
see  what  happens  when  the  page  size  Z  is  increased  without  increasing 
the  array  sizes,  or  N.   In  the  second  case  we  find  out  the  effects  of 
increasing  N  without  increasing  Z.   In  the  last  case  both  N  and  Z  are 
increased.   In  all  cases  we  are  interested  in  the  programs  as  long  as 
N  >  Z,  otherwise  their  memory  requirements  are  relatively  small  and  they 
are  not  of  concern  to  us. 

When  the  array  sizes  are  not  changed  and  the  page  size  is  in- 
creased, then  by  extending  our  previous  discussion  from  the  behavior  of 
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loops  to  the  behavior  of  a  transformed  program,  we  do  not  expect  M   to 

change  significantly  and  for  most  programs  it  will  not  change  at  all. 

To  see  why  M   is  not  expected  to  change,  let  us  remember  that  M   is 
'      ot  v  &  ot 

the  memory  allotment  at  the  minimum  space- time  cost  of  the  program. 

When  Z  is  increased,  the  reduction  in  the  number  of  page  faults  generated 

by  each  loop  in  the  program  (when  it  is  not  thrashing)  is  proportional 

to  the  reduction  in  the  number  of  pages  spanned  by  the  arrays  of  the  loop. 

Thus  although  m  's  of  the  individual  loops  are  not  expected  to  change, 

the  relative  contribution  of  each  loop  to  the  total  space-time  cost  might 

change.   This  will  happen  if  relative  changes  in  the  number  of  page 

faults  generated  by  the  loops  are  not  the  same.   However,  since  most  of 

the  transformed  program's  time  is  spent  in  localities  (iT-blocks)  with 

five  array  names  or  so,  the  changes  in  M   ,  if  they  ever  occur,  will  be 

very  little.   In  other  words  M   for  any  of  our  programs  will  always  be 

less  than  8  and  mostly  around  5,  irrespective  of  the  page  size  or  the 

array  sizes. 

Since  as  Z  is  increased  the  number  of  pages  spanned  by  each 

array  will  decrease,  then  DP,  the  number  of  distinct  pages  referenced  by 

each  program  will  decrease.   Hence  the  asymptotic  value  of  the  page 

fault  curves  of  both  the  transformed  and  untransformed  program  will  drop. 

Thus  the  values  of  f  (m)  for  m  >  M   will  decrease.   For  m  <  M   .  f  (m) 

t  _   ot  ot'   t 

is  not  expected  to  change  much  because  the  program  will  be  thrashing. 

This  is  also  true  for  f(m)  of  the  untransformed  program  at  m  <  M  .   Thus 

in  the  memory  range  M    '-   m  <  M  our  results  will  improve.   This  is 
J  ot       o 

because,  as  mentioned  previously,  in  this  range  f  (m)  is  decreased  while 
f(m)  will  not  drop  much.   We  do  expect,  however,  a  drop  in  M  which  is 
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more  appreciable  than  the  change  in  M   .   Note  that  for  some  untrans- 

ot 

formed  programs  M  might  not  change  or  changes  slightly  depending  on  how 
well  the  program  is  behaving.   Thus  the  general  conclusion  is  that,  when 
the  page  size  is  increased  then  the  difference  between  the  f  (m)  and 
f(m)  curves  in  the  region  M   <  m  <  M  is  expected  to  increase  (or  at 
least  not  to  decrease)  while  the  width  of  this  region  might  in  general 
decrease.   Similar  remarks  apply  to  the  ST(m)  and  ST  (m)  curves.   To 
check  the  validity  of  our  arguments  we  have  changed  the  page  size  (with- 
out changing  the  array  sizes)  and  obtained  the  page  faults  and  space- 
time  cost  data  for  4  of  our  programs:   BIGEN,  FIELD,  MAMOCO,  and  TWOWAY. 
The  results  were  in  agreement  with  our  expectations.   As  an  example  we 
show  in  Figures  30  and  31  the  faults  and  space-time  cost  curves  with  a 
page  size  of  256  words  for  program  MAMOCO  and  its  transformed  version. 
We  also  show  the  curves  for  a  page  size  of  64  words.   Note  that  M  has 
dropped  from  31  to  17.   In  BIGEN,  with  similar  changes  in  page  sizes 
(from  64  to  256)  M  did  not  change.   The  untransformed  program  of  BIGEN 
is  much  better  behaved  than  MAMOCO.   Also,  for  BIGEN  M   did  not  change 

while  in  MAMOCO  M   increased  from  6  to  8  (though  ST  (6)  and  ST  (8)  for 
ot  e    t  t 

Z  =  64  are  not  very  different).   Note  the  increase  in  the  improvement 
in  the  page  faults  and  space-time  cost  when  Z  was  increased  to  256. 

The  conclusion  we  reached  in  the  previous  paragraph  is  really 
relevant  to  the  validity  of  our  results  under  the  worst  possible  condi- 
tions, namely  Z  increasing  without  any  increase  in  the  array  sizes.   A 
more  realistic  approach  would  be  to  allow  both  Z  and  the  array  sizes  to 
increase.   As  we  have  indicated  previously,  the  sizes  of  arrays  can 
easily  grow  by  a  factor  of  10  for  some  of  our  programs.   This  is  comparable 
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Figure  30.   The  Page  Faults  Curves  for  Program  MAMOCO 
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Figure  31.   The  Space-Time  Cost  Curves  for  Program  MAMOCO 
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or  even  in  many  cases  more  than  the  expected  growth  of  the  page  size.   If 

the  sizes  of  the  arrays  grow  more  than  the  page  size,  our  results  will 

be  improved,  and  depending  on  the  program,  the  improvement  can  be  drastic. 

By  an  argument  similar  to  what  we  made  previously»  M   is  not  expected 

to  change  much.  M  ,  however,  will  increase.   Thus,  the  range  of  memory 

allotment  which  is  of  concern  to  us  (M   <  m  <.   M  )  will  be  increased. 

ot        o 

In  this  memory  range  the  page  faults  of  the  untransf ormed  program  will 
increase  in  a  manner  which  is  roughly  proportional  to  the  increase  in 
the  number  of  words  per  array.   The  page  faults  of  the  transformed  pro- 
gram, however,  will  increase  in  a  manner  which  is  roughly  proportional  to 
the  increase  in  the  number  of  pages  per  array.   In  other  words  if  we 
have  only  one  dimensional  arrays  in  a  program,  the  page  faults  of  the 
untransf ormed  program,  f (m) ,  in  the  range  M   <  m  <  M  are  in  the  best 
case  0(N),  while  f   is  0(N/Z).   Thus  if  the  array  sizes  grow  faster  than 
the  page  size,  the  region  of  improvement  will  increase  (M   £  m  <  M  )  and 
the  degree  of  improvement  will  increase.   If  the  page  size  is  increased 
more  than  the  increase  in  the  array  sizes,  then  we  have  the  situation 
discussed  in  the  previous  paragraph. 

What  happens  if  both  the  page  size  is  increased  and  the  array 
sizes  are  increased  such  that  the  number  of  pages  per  array  stays  the 

same?   In  this  case  it  is  easy  to  see  that  neither  M  nor  M  ^  will  change, 

o      ot 

Moreover,  f  (m)  in  the  range  M   <  m  <  M  will  not  change.   However,  f(m) 
t  ot        o 

in  this  range  will  increase  in  a  manner  which  is  roughly  comparable  to 
the  increase  in  the  number  of  words  per  array.   Hence  our  results  will 
be  improved.   We  believe  that  this  case,  where  both  the  array  sizes  and 
the  page  size  grow  in  a  comparable  way,  represents  the  most  realistic 
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situation  as  far  as  existing  virtual  memory  machines  and  the  programs 
which  cause  problems  for  these  machines  are  concerned. 

To  check  our  conclusions  for  this  latter  case  we  have  changed 
the  page  size  and  the  array  sizes  of  6  of  our  programs  such  that  the 
number  of  pages  per  array  stays  unchanged.   These  programs  are:   CD,  FLR, 
GE,  LUD,  MATMUL,  and  MATTRP .   Our  experimental  findings  agreed  precisely 
with  our  expectations.   As  an  example  we  show  in  Figures  32  and  33  the 
curves  for  program  MATMUL.   For  this  matrix  multiply  program  the  page 
sizes  are  64  words  and  512  words.   In  both  cases  each  two-dimensional 
array  in  the  program  spanned  25  pages.   Thus  DP  in  both  cases  is  75. 
For  Z  =  64  the  dimensions  of  the  arrays  were  40x40.   When  we  increased  Z 
to  512  we  chose  the  dimensions  of  the  arrays  to  be  101x101.   These 
dimensions  were  chosen  particularly  to  be  identical  to  those  used  by 
Elshoff  for  the  same  program  in  [ELSH74].   This  is  because  we  wanted  to 
compare  our  results  in  Figure  32  to  the  best  results  obtained  by  Elshoff 
when  he  used  all  his  rules  to  improve  the  locality  of  the  same  matrix 
multiplication  program.   However,  this  choice  of  the  array  dimensions 
reduces  the  improvement  of  our  results  as  Z  in  changed  from  64  to  512. 
From  this  point  of  view  it  would  have  been  more  fair  to  choose  the 
dimensions  to  be  110x110.   This  is  because  with  Z  =  64  and  array  dimensions 
of  40x40  all  points  of  the  25  pages  of  each  array  are  referenced  (remem- 
ber we  are  using  the  submatrix  storage  scheme) .   With  Z  =  512  and  101x101 
arrays  only  79.7%  of  the  words  in  the  25  pages  of  an  array  will  be 
referenced.   With  110x110  pages  94.5%  of  the  words  in  the  25  pages  of 

an  array  will  be  referenced.   Since  f(m)  for  M    '  m  <  M  increases  with 

ot  ~      o 

the  number  of  words  referenced  while  f  (m)  in  this  range  is  dependent  on 
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Figure  32.   The  Page  Faults  Curves  for  Program  MATMUL 
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the  number  of  pages  referenced,  changing  the  array  dimensions  from  101x101 
to  110x110  would  have  left  f  (m)  unchanged  and  would  have  increased  f (m) 
by  more  than  14.8%  (94.5%  -  79.7%). 

We  note  that,  as  expected,  the  curves  of  the  transformed 
program  are  identical  for  Z  =  64  and  Z  =  512.   For  the  untransformed  pro- 
gram the  number  of  page  faults  and  the  space- time  cost  have  increased 
when  Z  was  increased.   The  increase  in  Z  is  a  factor  of  8.   For  m  <  10, 
f(m)  is  increased  by  a  factor  of  6.17  (for  110x110  arrays  the  increase 
in  f (m)  would  be  greater  than  7.3).   Thus,  the  increases  in  f(m)  and  Z 
are  comparable  in  this  memory  range.   We  note  that  the  difference  between 
the  f(m)  curves  decreases  as  the  memory  allotment  is  increased.   For 
m  >  M  =41,  f(m)  is  independent  of  the  page  size. 

The  data  for  the  Elshoff  curves  was  obtained  from  [ELSH74] (in 
this  paper  there  is  no  data  for  m  >  20  pages) .   We  observe  that  our 
original  program  produced  fewer  page  faults  than  Elshoff 's  original  pro- 
gram (for  3  <  m  <  10  the  reduction  factor  is  2  and  for  12  <  m  <  20  it  is 
66.7).   We  have  achieved  this  improvement  simply  by  storing  multi- 
dimensional arrays  using  the  submatrix  storage  scheme.   Elshoff,  however, 
coded  his  program  in  PL1  which  stores  multi-dimensional  arrays  by  rows. 
Comparing  the  curve  of  our  transformed  program  to  the  curve  of  the  program 
using  the  combination  of  all  Elshoff 's  rules  we  note  that  our  automatic 
transformation  techniques  (combined  with  the  submatrix  storage  scheme) 
are  as  powerful  as  Elshoff 's  rules  (for  m  <  16  our  transformed  program 
produces  even  fewer  page  faults). 
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5.   CONCLUSIONS  AND  EXTENSIONS 

We  hope  that  this  thesis  has  been  successful  in  drawing  the  attention 
of  the  computer  manufacturers  and  scientists  to  the  fact  that  compilers 
should  use  special  transformations  when  compiling  for  virtual  memory  com- 
puters.  It  is  very  frustrating  to  find  out  that  existing  compilers  do  not 
make  any  distinction  between  compiling  for  a  virtual  memory  machine  or  for  a 
non-virtual  memory  machine. 

Although  in  the  last  decade  a  tremendous  number  of  papers  have 
been  written  about  virtual  memory  systems,  the  behavior  and  control  of  these 
systems  are  still  not  well  understood.   We  believe  that  this  is  due  to  the 
approach  taken  by  many  researchers  in  which  programs  were  treated  as  black 
boxes  that  generate  reference  strings.   More  effort  needs  to  be  dedicated 
to  studying  what  is  in  these  boxes,  namely  the  programs  themselves.   In 
this  thesis  we  have  shown  that  programs,  as  written  by  people,  do  not 
behave  well  in  a  virtual  memory  environment.   We  have  also  shown  that  simple 
compiler  transformations  can  force  programs  to  behave  well  (and  hence  be 
easy  to  model  and  manage)  and  cost  less  to  be  executed. 

We  would  like  to  use  the  rest  of  this  final  chapter  to  suggest  some 
points  for  future  research.   We  will  discuss  three  main  issues.   First,  we 
discuss  possible  improvements  of  some  of  the  transformations  of  Chapter 
Three.   Second,  we  will  raise  some  questions  concerning  the  implications  of 
our  results  for  the  memory  hierarchy  design  problem.   Third,  we  will  point 
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out  the  importance  of  extending  our  techniques  to  non-numeric  programs 
(e.g.,  Cobol  programs). 

From  all  the  transformations  presented  in  Chapter  Three,  the 
nonbasic  to  basic  TT-block  transformation  seems  to  be  the  most  costly.   The 
algorithm  used  in  this  transformation  is  simple.   However,  the  number  of 
control  instructions  executed  in  the  transformed  program  is  increased 
drastically  (for  program  LUD  the  increase  is  almost  an  order  of  magnitude — 
see  Table  8).   A  more  elaborate  algorithm  can  be  used  to  apply  the  page 
indexing  transformation  to  a  nonbasic  Tr-block  without  first  transforming  it 
to  a  basic  Tr-block.   In  what  follows  we  illustrate  this  technique,  the  non- 
basic  Tr-block  breaking  technique,  by  applying  it  to  Program  16-a  of  Section 
3.5.3. 

By  definition,  the  statements  of  a  nonbasic  Tr-block  fall  at 
different  nest  depth  levels.   The  general  idea  here  is  to  identify  the 
values  of  the  different  index  variables  which  cause  the  recurrence  in  the 
TT-block  and  solve  the  ir-block  for  these  values  first.   Then  we  will  be 
left  with  a  basic  TT-block.   Consider  Program  16-a  which  is  repeated  below. 

Program  16-a. 

DO   S2    I  =  1,N 

S±        B(I,1)  =  A(I,1)  **.5 
DO    S2   J  =  1,N 

S2   A(I+1,J)  =  B(I,J)  +  C(I,J) 
By  examining  the  data  dependences  in  this  program  we  find  that  the  recur- 
rence occurs  when  J  =  1  (i.e.,  the  dependence  arcs  going  from  S..  to  S~ 
and  from  S?  to  S  are  due  to  the  fact  that  J  takes  the  value  1.   Thus  if 
J  never  took  the  value  1  there  will  be  no  recurrence).   Hence,  this  non- 
basic  TT-block  can  be  divided  into  two  basic  ones  as  follows: 
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Program  16-e . 

DO    S     I  =  1,N 

S±1        B(I,1)  =  A(I,1)  **-5 

521  ACl+1,1)  =  B(I,1)  +  C(I,1) 
DO    S22    I  =  1,N 

DO    S22   J  =  2,N 

522  A(I+1,J)  =  B(I,J)  +  C(I,J) 

This  program  can  now  be  page  indexed  as  follows: 
Program  16-f  . 

DO    S22    1=1,  [N/RZl 

ILB  =  1  +  (IP-1)  *RZ 

IUB  =  MIN  (IP*RZ,N) 

DO    S     I  =  ILB, IUB 
S11   B(I,1)  =  A(I,1)  **.5 
S21   A(I+1,1)  -  B(I,1)  +  C(I,1) 

DO     S22    JP  =  1,  [N/RZl 

JLB  =  MAX(2,(1  +  (JP-1)*RZ)) 

JUB  =  MIM(JP*RZ,N) 

DO     S  2    I  =  ILB, IUB 


DO 


S22    J  =  JLB, JUB 


S22   A(I+1,J)  =  B(I,J)  +  C(I,J) 

We  have  used  this  concept  of  breaking  nonbasic  recurrences  in 

programs  CD,  GE,  and  LUD.   We  obtained  the  same  curves  of  page  faults 

and  space-time  cost  versus  memory  allotment  as  before  (for  program  LUD 

we  got  better  results  here  because  loop  fusion  is  not  used  as  it  was  in  the 

nonbasic  to  basic  transformation.   M   is  reduced  from  6  to  3).   Table  23 

ot 

compares  the  number  of  instructions  executed  when  using  the  recurrence 
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breaking  technique  and  the  nonbasic  to  basic  --block  transformation.   The 
advantages  of  the  recurrence  breaking  technique  are  obvious.   However, 
more  work  needs  to  be  done  to  determine  the  complexity  of  this  technique 
and  its  implementation  problems. 

Another  transformation  technique  which  needs  further  investigation 
is  one  we  used  in  the  Fast  Fourier  Transform  program,  FOURTR.   Basically 
what  we  did  can  be  illustrated  by  the  following  example. 

Program  19-a . 

DO  S  I  =  1,N1 

DO  S  J  =  I,N2,DELT 

S   A(J)  =  B(J)  +  C(J) 

Table  23. 

Comparing  the  Two  Techniques  of 
Transforming  Nonbasic  TT-Blocks  . 


Program 


Original 
Program 


Number  of  Instructions  Executed 


Nonbasic  to 
Basic  iT-Block 
Transformation  Used 


Recurrence 

Breaking 

Transformation  Used 


CD 
GE 
LUD 


234211 
494314 
507543 


2202748 
1619039 
2247035 


295547 
567741 
676576 


In  this  program  if  DELT  >  Nl  then  its  locality  can  be  improved  by  trans- 
forming it  as  follows  (the  mean  time  between  references  to  the  same  page 
will  be  smaller)  : 
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Program  19-b . 

DO   S   1=1,  r(N2-l)/DELTl 

JLB  =  1  +  (I-1)*DELT 

JUB  =  Nl  +  (I-1)*DELT 

DO   S   J  =  JLB,  JUB 

S   A  (J)  =  B(J)  +  C(J) 

We  have  chosen  not  to  discuss  this  technique  in  Chapter  Three  because 
we  did  not  encounter  this  situation  except  once  in  the  programs  we  examined. 
More  work  needs  to  be  done  to  investigate  how  important  this  case  is  and 
develop  the  needed  general  transformation  algorithm. 

Before  leaving  the  subject  of  improving  the  transformations  we  want 
to  mention  that  some  of  the  rules  we  adopted  in  some  transformations  were 
rather  strict.   For  example,  to  fuse  two  NP's  we  required  that  their  control 
structure  be  identical.   This  rule  does  not  have  to  be  so  rigid.   Loops  of 
slightly  different  control  structure  can  be  fused  if  the  difference  in  the 
control  structure  is  taken  care  of  by  appropriate  statements    (IF  state- 
ments, for  example).   Thus  the  loop  fusion  transformation  might  need  some 
tuning. 

The  second  area  which  has  a  great  potential  for  further  research 
is  investigating  the  implications  of  our  results  for  the  memory  hierarchy 
design  problem.   For  example,  pages  of  large  sizes  are  currently  favored 
over  small  pages  because  of  the  page  fault  service  time  overhead.   However, 
the  larger  the  page  the  worse  the  internal  fragmentation  problem  becomes 
[DENN70].   Currently,  with  CCD  technology,  people  are  building  smart 
(expensive)  controllers  which  reduce  the  latency  time  to  zero.   In  [FULL78] 
and  [SITE78],   a  cheaper  approach  is  suggested  which  cuts  the  average 
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latency  time  to  about  .1  of  the  rotation  cycle.   Thus  it  seems  that  the 
latency  problem  of  the  rotating  paging  devices  is  going  to  disappear  one 
way  or  another.   Hence,  the  page  fault  service  time  will  be  reduced. 
Since  transformed  programs  have  excellent  behavior  even  with  small  page 
sizes,  then  a  reconsideration  and  re-evaluation  of  the  best  page  size  needs 
to  be  done.   If  small  page  sizes  prove  to  be  better,  as  we  expect,  then 
this  leads  to  a  considerable  reduction  in  the  amount  of  physical  primary 
memory  needed  in  a  machine. 

This  thesis  invites  an  investigation  of  another  important  subject. 
In  the  last  few  years  research  has  been  going  on  at  the  University  of 
Illinois  to  design  transformations  for  enhancing  the  parallelism  of  ordinary 
programs  to  execute  efficiently  on  parallel  machines.   Not  much  attention 
was  given  to  the  effect  of  these  transformations  on  the  memory  space 
requirements  and  I/O  activities  of  programs.   The  challenging  question  which 
we  are  raising  here  is  how  can  programs  be  transformed  to  run  faster  on  a 
parallel  machine  which  is  supervised  by  a  virtual  memory  operating  system? 
When  transforming  programs  for  vector  machines  the  goal  is  to  maximize  the 
number  of  operations  which  can  be  executed  simultaneously.   The  larger 
the  number  of  data  items  which  can  be  processed  simultaneously,  the  higher 
is  the  speedup  achieved  by  a  vector  machine.   In  other  words,  parallel  and 
pipelined  machines  are  most  effective  when  they  process  long  vectors.   This 
necessitates  that  these  long  vectors  will  be  accessible  in  main  memory. 
From  a  paging  operating  system  point  of  view,  however,  the  goal  is  to 
minimize  the  space- time  cost,  the  primary  memory  requirements,  and  the  I/O 
activity  of  programs.   In  serial  machines  the  success  of  virtual  memory 
systems  is  based  on  the  locality  property,  i.e.,  only  a  small  portion  (small 
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number  of  pages)  of  the  data  (and  code)  of  a  program  need  to  be  in  main 

memory  at  one  time.   The  transformations  presented  in  this  thesis  are 
aimed  at  enhancing  this  locality  property.   Thus  it  seems  that  our  virtual 
memory  enhancement  transformations  and  the  parallelism  enhancement  trans- 
formations are  at  odds.   The  parallelism  transformations  assume  that 
all  the  elements  of  large  arrays  will  be  in  main  memory,  while  the  virtual 
memory  transformations  are  designed  to  make  programs  execute  with  as 
little  data  in  main  memory  as  possible!   It  is  interesting  to  find  out 
whether  some  compromise  transformations  can  be  designed  to  achieve  both 
goals:   enhancing  the  parallelism  and  locality  of  programs. 

Last,  but  not  least,  the  design  of  transformations  for  improving 
the  locality  of  nonnumeric  programs  (Cobol  programs  for  example)  is 
another  possible  area  for  future  research.   This  is  important  because  the 
majority  of  machine  cycles  in  the  world  are  spent  on  such  nonnumerically 
oriented  calculations. 
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APPENDIX 

In  this  appendix,  we  show  the  page  faults  and  the  space-time 
cost  curves  for  our  untrans formed  and  transformed  programs.   The 
replacement  algorithm  used  is  the  LRU  algorithm  and  the  page  size 
is  256  bytes.   The  space-time  cost  is  measured  in  pages-page  faults 
(see  Section  4.2.2). 
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Figure  34-a.   The  Page  Faults  Curves  for  Program  ADVECT 


221 


5x10" 


Space-_ 
Time 

Cost 


1X10- 


1X10 


5*10" 


Original  Program 


10 


15 


20 


25 


30 


35 


Pages  of  Real  Memory 


Figure  34-b.   The  Space-Time  Cost  Curves  for  Program  ADVECT 
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Figure  35-a.   The  Page  Faults  Curves  for  Program  BASE 
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Figure  35-b.   The  Space-Time  Cost  Curves  for  Program  BASE 
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Figure  36-a.   The  Page  Faults  Curves  for  Program  BIGEN 
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Figure  36-b.   The  Space-Time  Cost  Curves  for  Program  BIGEN 
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Figure  37-a.   The  Page  Faults  Curves  for  Program  CD 
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Figure  37-b.   The  Space-Time  Cost  Curves  for  Program  CD 


228 


2x10   r 

4 
1x10 

Page 
Faults 


lxlO" 


6x10 


I   Original  Program 


■-V 


10      15      20       40 
Pages  of  Real  Memory 


60 


80 


Figure  38-a.   The  Page  Faults  Curves  for  Program  DISPERSE 
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Figure  38-b.   The  Space-Time  Cost  Curves  for  Program  DISPERSE 
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Figure  39-a.   The  Page  Faults  Curves  for  Program  FIELD 
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Figure  39-b.   The  Space-Time  Cost  Curves  for  Program  FIELD 
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Figure  40-a.   The  Page  Fault  Curves  for  Program  FLR 
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Figure  41-a.   The  Page  Faults  Curves  for  Program  FOURTR 
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Figure  42-a.   The  Page  Faults  Curves  for  Program  GE 
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Figure  43-a.   The  Page  Faults  Curves  for  Program  INIT 
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Figure  44-a.   The  Page  Faults  Curves  for  Program  LUD 
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Figure  44-b.   The  Space-Time  Cost  Curves  for  Program  LUD 
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Figure  45-a.   The  Page  Faults  Curves  for  Program  MAIN 
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Figure  46-a.   The  Page  Faults  Curves  for  Program  MAMOCO 
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Figure  48-a.   The  Page  Faults  Curves  for  Program  MATTRP 
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